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Modern  commodity  operating  systems,  running  on  commodity  hardware,  are  frequently 
used  to  store  cryptographic  keys  and/or  to  perform  cryptographic  functions  such  as  digital 
signatures.  The  importance  of  their  security  can  hardly  be  overestimated  because  of  the 
following:  Digital  signatures  can  not  only  be  used  for  binding  agreements  and  authenticat¬ 
ing  Web  sites,  but  are  also  used  for  code  authentication,  including  authenticating  software 
updates,  such  as  the  widely-used  Microsoft  Windows  Automatic  Update.  Cryptographic  keys 
are  used  to  encrypt  sensitive  personal  data  stored  on  commodity  operating  systems. 

While  security  of  cryptographic  primitives  and  protocols  has  been  well-understood  in 
abstract  models,  there  is  relatively  little  understanding  and  study  of  the  security  of  cryptog¬ 
raphy  on  real  commodity  systems.  Furthermore,  while  one  could  exploit  special  hardware  to 
ensure  security  of  cryptographic  keys,  it  is  even  more  difficult  to  protect  cryptographic  func¬ 
tions  because  an  attacker  can  compromise  a  cryptographic  function  by  compromising  any  of 
many  different  points  in  the  invocation  process,  including  libraries  and  the  operating  system. 
We  examine  the  problem  of  protecting  cryptographic  keys  and  cryptographic  functions  on 
commodity  hardware  and  operating  systems,  with  a  focus  on  combating  attacks  committed 
by  software,  primarily  malware.  Specifically,  we  make  two  significant  technical  contributions: 
1.  We  demonstrate  a  technique  for  performing  encryption  without  having  the  cryptographic 
key  in  memory,  thereby  alleviating  RAM  disclosure  attacks  against  keys.  2.  We  create  a 


system  for  protecting  both  cryptographic  keys  and  digital  signatures  from  being  disclosed 
or  abused  (respectively)  by  malware,  while  allowing  security  properties  of  the  signatures  to 
be  verified  offline  by  remote  parties.  As  such,  this  thesis  moves  a  significant  step  towards 
bridging  the  gap  between  security  properties  of  cryptosystems  in  abstract  models  and  the 
needs  of  security  assurance  in  real-life  systems.  Our  results  are  also  generally  applicable  to 
maintaining  confidentiality  and  security  of  non-cryptographic  secrets  and  functions. 
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Chapter  1 

INTRODUCTION 


1.1  Motivation 

Computer  security  has  made  impressive  theoretical  progress,  such  as  proofs  of  security  for 
cryptographic  protocols.  However,  the  security  of  systems  in  practice  depends  on  the  se¬ 
curity  of  cryptography  in  practice,  because  security  properties  such  as  authentication  and 
confidentiality  rely  on  the  services  of  cryptography.  This  causes  a  “chicken- and-egg”  prob¬ 
lem,  because,  in  turn,  the  security  of  cryptography  in  practice  depends  on  the  security  of 
the  entire  system.  In  particular,  the  security  of  the  cryptographic  keys  and  cryptographic 
functions  in  real  systems  depends  on  the  security  of  the  software  stack.  This  is  an  area  that 
has  not  been  well-understood.  Worse,  users  of  commodity  hardware  and  operating  system 
software  frequently  find  themselves  besieged  by  spam,  malware,  and  other  security-related 
issues  which  can  contribute  to  reduced  system  security  and  further  exacerbate  the  reduc¬ 
tion  of  trustworthiness  of  cryptography  (and  thus  trustworthiness  of  the  security  services) 
in  practice. 

The  threat  of  malware  that  discloses  keys  and  other  sensitive  data  is  real.  For  an  example 
of  a  simple  malicious  software  attack  disclosing  keys,  see  [34],  Unfortunately,  even  specialized 
hardware  is  no  panacea  for  software  attacks  because  (1)  many  legacy  systems  may  not  deploy 
or  support  such  devices,  and  (2)  such  devices  have  a  cost,  which  is  often  substantial.  More 
importantly,  although  special  hardware  could  secure  the  keys  themselves,  the  attacker  can 
compromise  a  private  signing  function  without  compromising  a  private  signing  key  [46] .  This 
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would  be  particularly  easy  for  malware  because  malware  frequently  runs  using  the  same 
account  privilege  as  the  user  being  attacked,  making  ordinary  operating  system  defense 
mechanisms  useless. 

In  this  dissertation  we  emphasize  securing  cryptographic  keys  and  cryptographic  func¬ 
tions  (especially  digital  signatures)  against  malware,  while  still  allowing  the  user  to  run 
ordinary  hardware  and  software.  In  particular,  we  generally  allow  the  user  to  run  existing 
applications  and  operating  systems  unmodified,  as  long  as  the  application  architectures  use 
a  cryptographic  library  with  a  well-designed  API.  The  overall  goal  of  this  work  may  be 
summarized  thus: 

To  protect  cryptographic  keys  and  functions  from  software  attacks,  particularly 
attacks  by  malware. 

1.2  Dissertation  Overview 

This  dissertation  will  generally  focus  on  securing  cryptographic  keys  and  cryptographic  func¬ 
tions,  although  our  design  and  much  of  our  implementation  can  be  used  to  secure  other  kinds 
of  sensitive  data  and  functions.  Specifically,  the  dissertation  presents  two  mechanisms,  each 
of  which  is  realized  by  a  software  component. 

1.2.1  Safekeeping  Cryptographic  Keys  from  Memory  Disclosure 
Attacks 

Chapter  2  presents  and  analyzes  a  technique  for  using  a  cryptographic  key  without  having  the 
key  in  memory.1  This  gives  protection  against  memory  disclosure  attacks  which  otherwise 
can  recover  keys,  such  as  in  the  case  of  Apache  on  Linux  [34],  See  Figure  1.1  for  a  simplified 
depiction  of  the  technique:  basically  we  are  able  to  utilize  certain  registers  that  were  not 
designed  for  general  or  cryptographic  use.  As  a  specific  example,  a  prototype  is  created  that 

1This  chapter  essentially  corresponds  to  our  publication  [54], 


2 


CPU 

Registers 

CPU 

Registers  (Key) 

RAM  (Key) 

RAM 

(a)  Before  our  solution  (b)  With  our  solution 


Figure  1.1:  Before  our  solution,  large  keys  such  as  RSA  keys  are  in  RAM  while  in  use.  With 
our  solution,  keys  can  be  contained  in  CPU  registers. 

modifies  RSA  private  key  signing  in  OpenSSL  to  use  the  technique.  The  resulting  system 
has  the  following  features: 

1.  No  special  hardware  is  required;  only  resources  found  in  typical  CPU’s  are  used. 

2.  The  scheme  is  shown  to  leave  no  words  of  the  private  key  exponent  d  in  RAM. 

3.  A  RAM  scrambling  technique,  which  must  be  used  to  store  the  key  in  the  single- 
CPU-core  case,  is  evaluated,  showing  that  common  attacks  such  as  entropy  scanning, 
signature  scanning,  and  content  scanning  are  infeasible. 

1.2.2  Assured  Digital  Signing 

Chapter  3  presents  the  Assured  Digital  Signature  Service  Provider,  which  is  an  example  of 
protecting  cryptographic  functions  against  malware  attacks.lt  secures  both  the  cryptographic 
keys  used  for  signing  and  the  signing  function  itself,  even  in  the  presence  of  malware  running 
at  elevated  privilege  levels.  In  order  to  do  this,  it  uses  a  foundational  piece  that  we  believe 
might  be  of  independent  value,  called  the  protected  monitor.  This  provides  a  platform  on 
which  assured  services  can  be  built.  Figure  1.2  depicts  the  architecture  of  this  system. 
Further  details  will  be  explained  in  Chapter  3;  in  the  meantime  we  summarize  the  features 
of  the  resulting  system: 
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1.  Signature  requests  are  validated  using  four  criteria:  (i)  static  measurement  of  boot  and 
kernel  (using  TPM);  (ii)  secured  crypto  library;  (iii)  authentication  of  the  requesting 
program  (measure  binary);  (iv)  trusted  path  user  confirmation  dialog. 

2.  Key  storage  services  are  secure  against  malware  and  even  raw  disk  access  from  within 
the  VM. 

3.  Signature  request  processing  is  likewise  secure  against  malware. 


Figure  1.2:  Architecture  of  Assured  Digital  Signature  Service  Provider  (Chapter  3) 


1.3  Combining  the  Pieces  from  the  Chapters 

The  reader  may  wish  to  understand  how  the  various  pieces  we  have  created  fit  together.  One 
way  to  understand  this  is  to  examine  the  key  locations  that  are  protected  by  the  protection 
pieces: 

•  SSE  Key- in- Register  Cryptography  (Chapter  2)  protects  the  key  from  any  RAM  attack, 
including  disclosure  of  physical  RAM  (e.g.,  via  a  Firewire  attack). 

•  The  Assured  Digital  Signature  Service  Provider  (Chapter  3)  protects  keys  on  the  VM’s 
disk  as  well  as  in  the  VM’s  RAM.  (Of  course,  it  also  protects  the  signing  function  itself.) 
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The  two  building  blocks  could  be  used  together,  as  discussed  in  future  work  in  Section  5.2. 
As  such,  they  are  two  security  building  blocks  for  a  more  comprehensive  defense  framework 
that  we  suspect  remains  to  be  discovered. 
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Chapter  2 

SAFEKEEPING  CRYPTOGRAPHIC 
KEYS  FROM  MEMORY  DISCLOSURE 
ATTACKS 

2.1  Introduction 

How  should  we  ensure  the  secrecy  of  cryptographic  keys  during  their  use  in  RAM?  This 
problem  is  important  because  it  would  be  relatively  easy  for  an  attacker  to  have  unautho¬ 
rized  access  to  (a  portion  of)  RAM  so  as  to  compromise  the  cryptographic  keys  (in  their 
entirety)  appearing  in  it.  Two  example  attacks  that  have  been  successfully  experimented 
with  are  those  based  on  the  exploitation  of  certain  software  vulnerabilities  [34],  and  those 
based  on  the  exploitation  of  Direct  Memory  Access  (DMA)  devices  [57].  In  particular,  [34] 
showed  that,  in  the  Linux  OS  versions  they  experimented  with,  a  cryptographic  key  was 
somewhat  flooding  RAM,  meaning  that  many  copies  of  a  key  may  appear  in  both  allocated 
and  unallocated  memory.  This  meant  an  attacker  may  only  need  to  disclose  a  small  portion 
of  RAM  to  obtain  a  key.  As  a  first  step,  they  showed  how  to  ensure  only  one  copy  of  a  key 
appears  in  RAM.  Their  defense  is  not  entirely  satisfactory  because  the  success  probability 
of  a  memory  disclosure  attack  is  then  roughly  proportional  to  the  amount  of  the  disclosed 
memory.  Their  study  naturally  raised  the  following  question:  Is  it  possible,  and  if  so,  practi¬ 
cal,  to  safekeep  cryptographic  keys  from  memory  disclosure  attacks  without  relying  on  special 
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hardware  devices?  The  question  is  relevant  because  legacy  computers  may  not  have  or  sup¬ 
port  such  devices,  and  is  interesting  on  its  own  if  we  want  to  know  what  is  feasible  without 
special  hardware  devices.  (We  note  that  the  basic  idea  presented  in  this  chapter  may  also  be 
applicable  to  protect  cryptographic  keys  appearing  in  the  RAM  of  special  hardware  devices 
when,  for  example,  the  devices’  operating  systems  have  software  vulnerabilities  that  can 
cause  the  disclosure  of  RAM  content.) 

2.1.1  Our  Contributions 

In  this  chapter  we  affirmatively  answer  the  above  question  by  making  three  contributions. 
First,  we  propose  a  method  for  exploiting  certain  architectural  features  (i.e. ,  certain  CPU 
registers)  to  safekeep  cryptographic  keys  from  memory  disclosure  attacks  (i.e.,  ensure  a  key 
never  appears  in  its  entirety  in  the  RAM).  Nevertheless,  cryptographic  functions  are  still 
efficiently  computed  by  ensuring  that  a  cryptographic  key  appears  in  its  entirety  in  the 
registers.  This  may  sound  counter-intuitive  at  first  glance,  but  is  actually  achievable  as  long 
as  the  registers  can  assemble  the  key  on-the-fly  as  needed. 

Second,  as  a  proof  of  concept,  we  present  a  concrete  realization  of  the  above  method 
based  on  OpenSSL,  by  exploiting  the  Streaming  SIMD  Extension  (SSE)  XMM  registers 
of  modern  Intel  and  AMD  x86-compatible  CPU’s  [22],  The  registers  were  introduced  for 
multimedia  application  purposes  in  1999,  years  before  TPM-enabled  computers  were  manu¬ 
factured  (TCG  itself  was  formed  in  2003  [32]).  Specifically,  we  conduct  experimental  studies 
with  the  RSA  cryptosystem  in  the  contexts  of  SSL  3.0  and  TLS  1.0  and  1.1.  Experimental 
results  show  that  no  portion  of  a  key  appears  in  the  physical  RAM  (i.e.,  no  portion  of  a 
key  is  spilled  from  the  registers  to  the  RAM).  The  realization  is  not  straightforward,  and  we 
managed  to  overcome  two  subtle  problems: 

1.  Dealing  with  interrupts:  For  a  process  that  does  not  have  exclusive  access  to  a  CPU 
core  (i.e.,  a  single-core  CPU  or  a  single  core  of  a  multi-core  CPU),  we  must  prevent 
other  processes  from  reading  the  SSE  XMM  registers.  This  requires  us  to  prevent  other 
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processes  from  reading  the  registers  by  disabling  interrupts,  and  to  avoid  entering  the 
kernel  while  the  key  is  in  the  registers  (this  is  fortunately  not  difficult  in  our  case). 
Because  applications  such  as  Apache  generally  do  not  run  with  the  root  privilege  that 
is  required  for  disabling  interrupts,  we  designed  a  Loadable  Kernel  Module  (LKM)  to 
handle  interrupt-disabling  requests  issued  by  applications  such  as  Apache. 

2.  Scrambling  and  dispersing  a  cryptographic  key  in  RAM  while  allowing  efficient  re¬ 
assembling  in  registers:  Some  method  is  needed  to  load  a  cryptographic  key  into  the 
registers  in  a  secure  fashion;  otherwise,  a  key  may  still  appear  in  RAM.  For  this,  we 
implemented  a  heuristic  method  for  “scrambling”  a  cryptographic  key  in  RAM  and 
then  “re-assembling”  it  in  the  relevant  registers. 

Third,  we  articulate  an  (informal)  adversarial  model  of  memory  disclosure  attacks  against 
cryptographic  keys  in  software  environments  that  may  be  vulnerable.  The  model  serves 
as  a  systematic  basis  for  (heuristically)  analyzing  the  security  of  software  against  memory 
disclosure  attacks,  and  may  be  of  independent  value. 

2.1.2  Discussion  on  the  Real-World  Significance 

As  will  be  shown  in  the  case  study  prototype  system,  the  method  proposed  in  this  chapter 
can  be  applied  to  legacy  computers  that  have  some  architectural  features  (e.g.,  x86  XMM 
registers  or  other  similar  ones).  Two  advantages  of  a  solution  based  on  the  method  are 
(1)  it  can  be  obtained  for  free,  and  (2)  it  could  be  made  transparent  to  the  end  users; 
both  of  these  ease  real-world  adoption.  However,  we  do  not  expect  that  the  solution  will 
be  utilized  in  servers  for  processing  high-throughput  transactions,  in  which  case  special 
high-speed  and  high-bandwidth  hardware  devices  may  be  used  instead  so  as  to  accelerate 
cryptographic  processing.  Nevertheless,  our  solution  is  capable  of  serving  50  new  HTTPS 
connections  per  second  in  our  experiments.  The  attacks  addressed  in  this  chapter  are  memory 
disclosure  attacks,  which  are  mainly  launched  via  the  exploitation  of  software  vulnerabilities 


in  operating  systems. 


2.1.3  Chapter  Outline 

The  rest  of  this  chapter  is  organized  as  follows.  Due  to  the  complexity  of  the  adversarial 
model,  we  specify  attacks  against  based  on  two  dimensions.  One  dimension  is  independent 
of  our  specific  solution  and  is  elaborated  in  Section  2.2.1  because  it  guides  the  design  of  our 
specific  solution.  The  other  dimension  is  dependent  upon  our  solution  (e.g.,  the  attacker  may 
attempt  to  identify  weaknesses  specific  to  our  solution)  and  presented  in  Section  2.4,  after 
we  present  our  specific  solution  in  Section  2.3.  Section  2.5  informally  analyzes  the  security 
of  the  resulting  system.  Section  2.6  reports  the  performance  of  our  prototype.  Section  2.7 
concludes  the  chapter  with  some  open  problems.  Note  that  related  work  is  discussed  in 
Chapter  4. 

2.2  Design  Rationale  for  Our  Solution 

2.2.1  General  Threat  Model 

Independent  of  our  specific  solution  design,  we  consider  an  attacker  who  can  disclose  some 
portion  of  RAM  through  some  means  that  may  also  give  the  attacker  some  extra  power  (as 
we  discuss  below).  To  make  this  concrete,  in  what  follows  we  present  a  classification  of  the 
most  relevant  memory  disclosure  attacks  (see  also  Figure  2.1). 

Pure  memory  disclosure  attacks.  Such  attackers  are  only  given  the  content  of  the 
disclosed  RAM.  Depending  on  the  amount  of  disclosed  memory,  these  attacks  are  divided 
into  two  cases:  partial  memory  disclosure  and  full  memory  disclosure.  Furthermore,  partial 
disclosure  attacks  can  be  divided  into  two  cases:  untargeted  partial  disclosures  and  targeted 
partial  disclosures.  An  untargeted  partial  attack  discloses  a  portion  of  memory  but  does  not 
allow  the  attacker  to  specify  which  portion  of  the  memory  (e.g.,  random  portions  of  RAM 
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Memory  Disclosure  Attacks 


Pure  Memory  Disclosure  Attacks  Augmented  Full  Memory  Disclosure  Attacks 


Untargeted  Targeted  Reverse  Run  Executable  in 

Partial  Partial  Engineer  Emulator  or  VM 

Figure  2.1:  Memory  disclosure  attack  taxonomy. 

that  may  or  may  not  have  a  key  in  it).  In  contrast,  a  targeted  partial  attacker  somehow 
allows  the  attacker  to  obtain  a  specific  portion  of  RAM.  Although  we  do  not  know  how  to 
accomplish  this,  this  may  be  possible  for  some  sophisticated  attackers. 

Augmented  full  memory  disclosure  attacks.  Compared  with  the  full  memory  disclosure 
attacks  where  attackers  just  analyze  the  byte-by-byte  RAM  content,  augmented  full  memory 
disclosures  give  the  attacker  extra  power.  The  first  possible  augmentation  is  to  allow  the 
attacker  to  run  processes  on  the  machine  that  is  being  attacked.  This  requires  the  attacker 
to  have  access  to  a  user  account  on  the  machine,  but  neither  root  nor  the  account  that  owns 
the  key  being  protected  (e.g.,  apache);  otherwise,  we  cannot  hope  to  defeat  the  attacker. 
The  main  trick  here  is  that  the  attacker  here  may  seek  to  circumvent  the  ownership  of  the 
registers  that  store  the  key  (if  applicable).  The  second  possible  augmentation  is  for  the 
attacker  to  use  the  victim  user’s  own  executable  image  (which  is  probably  in  the  disclosed 
RAM)  to  recover  the  key,  which  is  possible  because  the  executable  together  with  its  state 
must  be  able  to  recover  the  key.  We  further  classify  this  augmentation  into  two  cases: 
reverse-engineering,  where  the  attacker  reverse-engineers  the  executable  and  state  to  recover 
the  key;  and  running  the  executable  in  an  emulator  or  VMM  (Virtual  Machine  Monitor), 
where  the  attacker  can  actually  execute  the  entire  disclosed  memory  image  and  discover 
(for  example)  what  is  put  in  the  disclosed  RAM  or  registers,  if  the  attacker  can  somehow 
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simulate  the  unknown  non-RAM  state  such  as  CPU  registers.  Finally,  an  attacker  could 
employ  multiple  augmentations  simultaneously,  which  we  we  label  as  “combination”  in  our 
classification. 

2.2.2  Why  Use  Registers? 

There  are  four  reasons  for  choosing  to  place  the  key  in  registers  rather  than  in  some  other 
location: 

1.  Registers  have  very  fast  access.  In  fact,  registers  can  be  accessed  more  quickly  by  the 
CPU  than  any  other  part  of  the  system,  including  on-chip  cache  [36].  This  is  by  design. 

2.  We  can  control  access  to  the  registers  in  a  CPU  by  dedicating  the  CPU  to  the  process 
that  owns  the  key.  Almost  all  other  resources  are  accessible  by  all  CPU’s  in  the  system. 

3.  Registers  are  available  on  all  x86  systems;  no  particular  hardware  is  required.  Note  the 
registers  we  use  require  support  for  SSE2,  which  has  been  offered  since  Intel’s  Pentium 
4  was  introduced  in  2000,  and  in  AMD  processors  since  2003. 

4.  Virtually  all  other  parts  of  the  system  are  accessible  using  a  RAM  address  access, 
either  by  Direct  Memory  Access  (DMA)  or  because  caches  are  accessed  using  memory 
addresses. 


2.3  The  Safekeeping  Method  and  Its  Implementation 

In  this  section  we  first  discuss  the  basic  idea  underlying  our  method,  and  then  elaborate 
the  relevant  countermeasures  that  we  employ  to  deal  with  threats  mentioned  above  (this 
explains  why  we  said  that  the  threat  model  guided  our  design). 
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2.3.1  Basic  Idea  and  Resulting  Prototype 

The  basic  idea  of  our  method  is  to  exploit  some  modern  CPU  architectural  features,  namely 
large  sets  of  CPU  registers  that  are  not  heavily  used  in  normal  computations.  Intuitively, 
such  registers  can  help  “avoid”  cryptographic  keys  appearing  in  RAM  during  their  use, 
because  we  can  make  a  cryptographic  key  appear  in  RAM  only  in  some  scrambled  form, 
while  appearing  in  these  registers  in  cleartext  and  in  its  entirety.  In  our  prototype,  we  use 
the  x86  XMM  register  set  of  the  SSE  multimedia  extensions,  which  were  originally  introduced 
by  Intel  for  floating-point  SIMD  use  and  later  also  adopted  by  AMD.  Each  XMM  register  is 
128  bits  in  size.  Eight  such  registers,  totaling  1024  bits,  are  available  in  32-bit  architectures; 
64-bit  architectures  have  16,  for  a  total  of  2048  bits.  These  registers  can  be  exploited  to  run 
cryptographic  algorithms  because  a  32-bit  x86  CPU  can  thus  store  a  1024-bit  RSA  private 
exponent,  and  a  64-bit  one  can  store  a  2048-bit  exponent.1 

Our  prototype  is  based  on  OpenSSL  0.9.8e,  the  Ubuntu  6.06  Linux  distribution  with  a 
2.6.15  kernel,  and  SSE2  which  was  first  offered  in  Intel’s  Pentium  4  and  in  AMD’s  Opteron 
and  Athlon-64  processors.  Figure  2.2  depicts  the  resulting  system  architecture.  It  adds  a 


Figure  2.2:  The  resulting  system  architecture 


new  supporting  mechanism  layer  that  loads  a  scrambled  key  into  the  relevant  registers  (i.e. , 

assembling  the  scrambled  key  into  the  original  key)  and  makes  it  available  to  cryptographic 

1Product  roadmaps  for  Intel  and  AMD  contain  extensions  enlarging  these  registers  to  256  bits  (as  part 
of  Advanced  Vector  Extensions  (AVX)),  and  we  anticipate  continued  enlargement  in  the  future. 
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routines. 


2.3.2  Scrambling  and  Dispersing  a  Key  in  RAM 

A  crucial  issue  in  our  solution  is  to  store  the  key  in  RAM  such  that  it  will  be  difficult  for 
attackers  to  compromise.  For  this,  one  may  suggest  to  encrypt  the  key  in  RAM  and  then 
decrypt  and  put  the  key  directly  into  registers. 

However,  this  approach  has  two  issues  that  are  not  clear:  (i)  where  the  key  for  this  “outer” 
layer  of  encryption  can  be  safely  kept  (i.e. ,  we  now  have  a  chicken- and- egg  problem,  because 
that  key  needs  to  be  encrypted  too),  and  (ii)  how  to  ensure  that  there  is  no  intermediate 
version  of  the  key  in  RAM.  A  similar  argument  would  also  be  applicable  to  other  techniques 
aimed  for  a  similar  purpose.  As  such,  we  adopt  the  following  heuristic  method  for  scrambling 
and  dispersing  a  key  in  RAM: 

•  Initialization:  This  operation  prepares  a  dispersed  scrambled  version  of  the  key  in 
question  such  that  the  resulting  bit  strings  are  stored  on  some  secure  storage  device 
(e.g.,  hard  disk  or  memory  stick)  and  thus  can  later  be  loaded  into  RAM  as-is.  This 
can  be  done  in  a  secure  environment  and  the  resulting  scrambled  key  may  be  kept  on 
a  secure  storage  device  such  as  a  memory  stick. 

•  Recovery:  the  key  in  its  scrambled  form  is  first  loaded  into  RAM,  and  then  somehow 
“re-assembled”  at  the  relevant  registers  so  that  the  key  appears  in  its  entirety  in  the 
registers. 

As  illustrated  in  Figure  2.3,  the  initialization  method  we  implemented  proceeds  as  follows, 
(i)  The  original  key  is  split  into  blocks  of  32  bits.  Note  that  the  choice  of  32-bit  words  is 
not  fundamental  to  the  design,  it  could  be  a  16-bit  word  or  even  a  single  byte,  (ii)  Each 
chunk  is  XOR’d  with  a  32-bit  chaff  that  is  independently  chosen.  As  a  line  of  defense,  it 
is  ideal  that  the  chaffs  do  not  help  the  attacker  to  identify  the  whereabouts  of  the  index 
table,  (iii)  Each  transformed  block  is  split  into  two  chunks  of  16  bits,  (iv)  The  chunks  are 
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Figure  2.3:  Prototype’s  method  for  scrambling  and  dispersing  key 

mixed  with  some  “fillers”  (i.e. ,  useless  place-holders  to  help  hide  the  chunks)  that  exhibit 
similar  characteristics  as  the  chunks  (e.g.,  entropy- wise  they  are  similar  so  that  even  the 
entropy-based  search  method  [61]  cannot  tell  the  fillers  and  the  chunks  apart).  Clearly,  the 
recovery  can  obtain  the  original  key  according  to  the  index  table,  each  row  of  which  consists 
of  a  chaff  and  the  address  pointers  to  the  corresponding  chunks.  Since  security  of  the  index 
table  is  crucial,  in  the  next  section  we  discuss  how  to  make  it  difficult,  to  compromise. 

We  note  that  some  form  of  All-Or-Nothing- Transformation  [15]  (as  long  as  the  inver¬ 
sion  process  can  be  safely  implemented  in  the  very  limited  environment  of  registers)  should 
be  employed  prior  to  the  scrambling  in  order  to  safeguard  against  attacks  that  work  on 
portions  of  RSA  keys  (e.g.,  [10]  gives  an  attack  that  can  recover  an  RSA  private  key  in 
polynomial  time  given  the  least-significant  n/4  bits  of  the  key).  Using  such  a  transformation 
protects  our  scheme  from  these  attacks  and  insulates  the  scheme  and  analysis  from  progress 
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in  partial-exposure  key  breaking  work.  This  also  protects  our  scheme  from  attacks  that 
exploit  structure  in  the  RSA  key,  such  as  some  attacks  from  Shamir  and  van  Someren  [61]. 
The  exact  technique  and  implementation  should  be  be  chosen  carefully  so  as  to  not  spill  any 
intermediate  results  into  RAM. 

2.3.3  Obscuring  the  Index  Table 

To  defend  against  an  attacker  who  attempts  to  find  and  follow  the  sequence  of  pointers  to 
the  index  table,  we  can  adopt  the  following  two  defenses. 

First  defense.  We  can  use  a  randomly-chosen  offset  for  all  the  pointers  in  the  table,  as 
well  as  a  randomly-chosen  delta  number  to  modify  the  data  values  themselves.  The  offset 
and  delta  are  chosen  once  before  the  table  is  constructed,  and  then  the  pointer  values  in  the 
table  are  actually  the  memory  location  minus  the  offset.  The  actual  data  values  stored  at 
the  memory  locations  are  the  portions  of  the  key  minus  the  delta  value.  This  means  that 
even  if  the  attacker  Ends  the  table,  the  pointers  in  it  are  not  useful  without  successfully 
guessing  the  offset  and  delta. 

We  must  prevent  the  attacker  from  simply  scanning  all  of  the  statically-allocated  data 
for  potential  offset  and  delta  values  and  trying  all  of  them  whenever  interpreting  a  possible 
table  pointer.  We  can  defend  against  this  by  using  (for  example)  16  numbers  as  the  set 
of  potential  pointer  offsets,  and  8  numbers  as  the  set  of  potential  delta  values.  A  random 
number  chosen  at  compile-time  determines  whether  the  actual  pointer  or  value  is  or  is  not 
XOR’d  with  each  member  of  the  corresponding  set.  (make  can  compile  and  run  a  short 
program  to  generate  this  number  and  emit  it  as  a  #define  suffixed  to  a  header  hie.  Such 
values  do  not  have  storage  allocated  and  only  appear  in  the  executable  where  they  are  used.) 
Carefully  constructing  an  expression  controlled  by  this  value  but  where  the  appearance  of  the 
value  itself  can  be  optimized  away  by  the  compiler  means  compiler  optimization  techniques 
will  ensure  that  this  constant  does  not  appear  directly  in  the  final  executable  (and  therefore 
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cannot  be  read  from  a  RAM  dump).  2 *  We  will  show  an  example  expression  below,  using  a 
conceptual  syntax  for  clarity. 

Each  number  in  the  set  is  the  same  size  as  the  pointer  or  short  value.  At  compile  time  one 
bit  determines  whether  to  XOR  the  two  high  halves,  and  the  following  bit  whether  to  XOR 
the  two  low  halves.  Note  that  breaking  each  number  into  two  separately-operated  pieces  is 
useful  because  it  squares  the  factor  that  we  are  increasing  the  attacker’s  search  space  by. 
The  use  of  each  set  forces  the  attacker  to  examine  416  and  48  possibilities  for  the  pointers 
and  short  values,  respectively.  Let  us  refer  to  the  64-bit  set  of  numbers  as  64.Bo--64.Bi5,  and 
designate  the  top  and  bottom  halves  of  these  as  64R^..64R^  and  6ABq ,.64B^  respectively, 
and  use  p  to  denote  the  pointer  being  masked.  Then, 

p  =  p  ©  (64 Bq  A  bit0)  ©  (64B,f  A  biti)...  ©  (64 Bf5  A  bit30)  ©  (64 B^  A  bit31) 

where  A  is  an  operator  that  returns  0  if  either  operand  is  zero,  and  returns  the  first  operand 
otherwise.  The  computation  is  similar  for  the  16-bit  short  values  that  contain  scrambled 
RSA  key  pieces. 

Second  defense.  Let  us  suppose  the  attacker  has  some  magical  targeted  partial  disclosure 
attack  that  identifies  the  index  table,  chunks,  offset  XOR  values,  and  delta  XOR  values  (note 
the  actual  possible  attacks  we  know  of  are  not  nearly  this  powerful).  The  control  values  for 
the  offset  XOR  can  be  efficiently  computed  using  the  chunk  addresses,  and  the  control  values 
for  the  delta  XOR  may  then  be  computed  with  a  cost  of  216. 

In  order  to  rigorously  defend  against  this,  we  can  add  a  compile-time  constant  (see  Section 
2.3.3)  that  is  used  to  specify  a  permutation  on  the  index  table.  Lookups  on  the  index  table 
will  now  use  this  constant  to  control  the  order  (e.g.,  the  index  used  would  be  the  index 
sought  plus  the  last  several  bits  (lgi,  t  is  table  size)  of  a  pseudo-random  number  generator 
based  on  the  pointer,  modulus  t.  The  pseudo-random  number  generator  must  have  small 

2We  verified  a  sample  expression  compiled  to  a  sequence  of  appropriate  XOR’s,  with  the  random  constant 

not  appearing,  in  gcc  3.4  and  4.0,  with  -02. 
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state  (current  value  kept  in  a  register),  be  possible  to  compute  entirely  inside  the  x86  register 
space  (limiting  on  32-bit  but  roomy  for  64-bit),  and  the  trailing  bits  must  not  repeat  within 
a  period  t).  A  32-bit  permutation  constant  (seed)  would  increase  the  attacker’s  search  space 
by  a  factor  of  232;  a  larger  constant  could  be  used  if  that  simplified  the  implementation  while 
providing  at  least  232  permutations. 

Discussion.  Without  these  defenses,  an  attacker  could  just  build  the  executable  on  an 
identical  system,  run  objdurap  and  look  for  the  appropriate  variable  name,  and  then  examine 
that  memory  location  in  the  process  to  find  the  index  table  (this  omits  some  details  such  as 
how  to  recover  the  process  page  table  which  gives  the  virtual  memory  mapping).  With  these 
defenses,  the  attacker  must  locate  and  interpret  particular  sequences  of  assembly  language 
instructions  in  the  particular  executable  being  used  on  this  machine  to  determine  how  to 
unscramble  and  order  pointers  and  values  in  each  of  various  stages  in  the  scrambling  process. 
The  possible  attack  routes  are  explained  in  Section  2.4  and  analyzed  in  Section  2.5. 

2.3.4  Disabling  Interrupts 

In  order  to  ensure  that  register  contents  are  never  spilled  to  memory  (for  a  context  switch  or 
system  event),  we  need  to  disable  interrupts.  This  can  be  achieved  by  disabling  interrupts 
via,  for  example,  a  kernel  module  that  provides  a  facility  for  non-root  processes  to  disable 
and  enable  interrupts  on  a  CPU  core.  However,  there  are  three  important  issues: 

1.  Since  illegitimate  processes  could  use  the  interrupt-disabling  functionality  to  degrade 
functionality  or  perform  a  denial-of-service  attack,  care  must  be  taken  as  to  which 
programs  are  allowed  to  use  this  facility.  A  mechanism  may  be  used  to  harden  the 
security  by  authenticating  the  application  binary  that  requests  disabling  interrupts, 
e.g.,  by  verifying  a  digital  signature  of  the  binary. 

2.  The  interrupt-disabling  facility  itself  may  be  attacked.  For  example,  the  kernel  module 
we  use  to  disable  interrupts  could  be  compromised  or  faked  so  that  it  silently  fails  to 
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disable  interrupts.  Fortunately,  we  can  detect  this  omission  from  userland  using  a  non- 
privileged  instruction  and  refuse  to  populate  the  XMM  registers,  reducing  the  attacker 
to  a  denial-of-service  attack,  which  was  already  possible  because  the  attacker  had  to 
have  kernel  access. 

3.  A  clever  attacker  might  be  able  to  prevent  the  kernel  module  from  successfully  disabling 
interrupts.  For  example,  the  attacker  might  perpetrate  a  denial-of-service  attack  on 
the  device  hie  used  to  send  commands  to  the  kernel  module.  Two  points  of  our  design 
make  this  particular  attack  difficult  for  the  attacker: 

(a)  First,  the  kernel  module  allows  multiple  processes  to  open  the  device  hie  simul¬ 
taneously,  so  that  multiple  server  processes  can  access  it,  meaning  an  attacker 
cannot  open  the  device  to  block  other  processes. 

(b)  Second,  the  code  that  calls  the  kernel  module  automatically  retries  if  interrupts 
have  not  become  disabled.  So  in  the  worst  case,  the  attack  is  downgraded  to  a 
denial-of-service  attack,  which  is  already  easy  when  the  attacker  has  this  level  of 
machine  access. 

Discussion.  Disabling  interrupts  could  cause  side-effects,  most  notably  with  real-time  video, 
or  dropping  network  traffic  if  interrupts  were  disabled  for  a  long  time,  which  would  cause 
a  retransmission  and  hence  some  bandwidth  and  performance  cost.  Having  multiple  cores, 
as  most  64-bit  machines  and  almost  all  new  machines  do,  would  mitigate  these  problems.3 
Moreover,  no  ill  effects  were  observed  from  disabling  interrupts  on  our  systems.  Note  that 
non-maskable  interrupts  such  as  page  faults  and  system  management  interrupts  cannot  be 
disabled  on  x86.  Thus  the  scheme  is  susceptible  to  low-level  attacks  that  modify  their 
handlers.  Such  attacks  require  considerable  knowledge  and  skill,  require  privileges  on  well- 
managed  systems,  and  are  frequently  hardware-specific;  we  do  not  deal  with  such  attacks  in 
the  present  work. 

3In  fact,  according  to  /proc/interrupts,  the  Linux  2.6.15  kernel  we  used  directed  all  external  interrupts 
to  the  same  core,  so  simply  using  the  other  cores  for  our  technique  would  avoid  the  problem  entirely. 
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2.4  Refining  Attacks  by  Considering  Our  Design 


Now  we  consider  what  key  compromise  methods  may  be  effective  against  our  design.  We 
emphasize  these  attacks  include  methods  specific  to  our  solution  and  thus  are  distinct  from 
the  general  threat  model,  whose  classes  of  attacks  are  independent  of  our  solution  and 
regulate  the  resources  available  to  the  attacker.  These  methods  specify  the  rows  of  our 
attack  analysis  chart  (Figure  2.4),  whereas  our  threat  model  specifies  the  columns.  The 
short  designation  used  in  the  figure  to  name  these  parts  is  highlighted  for  easy  reference 
when  examining  the  figure.  Often  multiple  approaches  can  be  used  to  achieve  the  same  goal, 
so  sometimes  the  attack  chart  lists  two  ways  to  accomplish  a  goal,  with  an  OR  after  the  first. 
When  multiple  steps  are  needed  to  accomplish  a  goal,  they  are  individually  numbered.  Here 
we  list  and  explain  the  methods  found  in  the  table: 

•  Retrieve  key  from  registers.  The  attacker  may  attempt  to  compromise  the  key  by 
reading  it  directly  from  the  XMM  registers. 

•  Retrieve  key  directly  from  RAM.  The  attacker  may  try  to  read  the  key  directly 
from  RAM,  if  present. 

•  Descramble  key  from  RAM.  These  are  the  most  interesting  and  subtle  attack 
scenarios.  Again,  since  multiple  approaches  may  be  used  to  achieve  the  same  attack 
effect,  sometimes  the  attack  chart  lists  two  ways  to  accomplish  a  given  objective,  with 
an  OR  after  the  first  (see  Figure  2.4).  Moreover,  when  multiple  steps  are  needed  to 
accomplish  an  objective,  they  are  individually  numbered.  The  descrambling  attacks 
may  succeed  via  two  means:  index  table  or  chunks. 

—  Via  index  table.  This  attack  can  be  launched  in  three  steps  (see  also  Figure  2.4): 
“1. Locate  index  table”,  “2. Interpret  index  table”,  and  “3. Follow  pointers” .  Specif¬ 
ically,  the  attacker  must  first  locate  the  table  by  scanning  RAM  for  it  (e.g.,  using 
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an  entropy  scan)  or  by  following  pointers  to  it.  Assuming  the  attacker  successfully 
locates  the  table,  the  attacker  must  then  determine  how  to  properly  interpret  it, 
since  the  pointers  are  scrambled  and  the  chunk  chaff  values  are  scrambled  also 
(per  Section  2.3.3).  One  way  to  interpret  the  table  is  to  somehow  compute  the 
actual  XOR  used  on  the  offsets  and  compute  the  actual  XOR  used  on  the  values, 
“Determine  actual  XOR  offset  and  XOR  delta”.  Another  way  is  to  “Use  deltas 
and  offsets  and  determine  combination” ,  this  means  to  find  the  deltas  and  offsets 
and  then  determine  the  proper  combination  of  them  (i.e. ,  the  value  of  the  control 
variable  embedded  in  the  executable  specifying  whether  to  use  each  individual 
delta  and  offset).  Finally,  if  the  attacker  has  successfully  located  the  table  and 
determined  how  to  interpret  the  table  itself,  the  pointers  must  be  followed  to 
actually  find  the  chunks  in  proper  order.  In  Section  2.3.3  we  discussed  how  to 
defend  against  this  by  introducing  a  substantial  number  of  permutations. 

—  Via  chunks.  The  attacker  can  avoid  interpreting  the  table  and  attempt  to  work 
from  the  chunks  directly.  This  requires  three  steps  (see  also  Figure  2.4).  First, 
the  attacker  must  locate  the  chunks  themselves  in  the  memory  dump  (“1. Locate 
chunks”).  Then,  the  attacker  must  interpret  the  chunks  (“2. Interpret  chunks”) 
that  were  XOR’d  with  the  chaff  values.  Lastly,  the  attacker  must  determine  the 
proper  order  for  the  chunks  (“3. Order  chunks”),  which  is  demanding  since  the 
number  of  permutations  is  considerable. 


2.5  Security  Analysis 

It  would  be  ideal  if  we  could  rigorously  prove  the  security  of  the  resulting  system.  Unfortu¬ 
nately,  this  is  challenging  because  it  is  not  clear  how  to  formalize  a  proper  theoretic  model. 
The  well-articulated  models,  such  as  the  ones  due  to  Barak  et  al.  [6]  and  Goldreich-Ostrovsky 
[30],  do  not  appear  to  be  applicable  to  our  system  setting.  Moreover,  the  aforementioned 


20 


“supporting  mechanism”  itself  may  be  reverse-engineered  by  the  attacker,  who  may  then 
recover  the  original  key.  We  leave  devising  a  formal  model  for  rigorously  reasoning  about 
security  in  our  setting  as  an  open  problem.  In  what  follows  we  heuristically  discuss  security 
of  the  resulting  system. 

Figure  2.4  summarizes  attacks  against  the  resulting  system,  where  each  row  corresponds 
to  a  key-compromise  attack  method  (see  Section  2.4)  whereas  the  columns  are  the  various 
threat  models.  At  the  intersection  of  a  column  and  row  is  an  attack  effect,  which  is  a  one 
or  two  letter  code  that  explains  the  degree  of  success  of  that  row’s  key  compromise  method 
given  that  column’s  threat  (see  codes  in  Section  2.5.2). 

2.5.1  Example  Scenario 

To  aid  understanding  of  the  chart,  we  consider  as  an  example  the  Full  Disclosure  threat 
model  where  the  attacker  is  given  the  full  RAM  content  and  attempts  to  compromise  the 
key  in  it.  In  this  case,  the  specific  attack  “retrieving  the  key  from  registers”  does  not  apply 
because  RAM  disclosure  attacks  do  not  contain  the  contents  of  registers.  Moreover,  the 
specific  attack  “retrieving  the  key  from  RAM”  fails  because  RAM  does  not  contain  the  key, 
as  detailed  in  effect  “B”  in  Section  2.5.2.  Thus,  the  attacker  may  then  try  to  retrieve  the 
key  via  the  index  table,  or  via  the  chunks  directly  as  elaborated  below. 

Via  index  table.  Continuing  down  the  column  of  the  Full  Disclosure  threat  model,  the 
attacker  scans  the  RAM  dump  for  the  index  table,  but  this  fails  because  the  table  has  no 
readily-obvious  identifying  information  (code  “C”  in  Figure  2.4).  Instead,  the  attacker  can 
build  the  executable  on  another  machine  so  as  to  find  the  storage  location  for  the  pointer 
to  the  index  table,  as  shown  in  code  “DS”  in  Figure  2.4.  The  attacker  may  try  to  guess 
the  actual  XOR  value  used  for  pointer  offsets  and  the  actual  XOR  value  used  for  chunk 
deltas  (“FI”  in  Figure  2.4),  but  the  search  space  is  226,  which  will  still  have  to  be  multiplied 
by  later  cost  factors  since  the  guess  can’t  be  verified  until  the  actual  key  is  assembled. 
Instead,  the  attacker  can  find  the  values  that  are  combined  to  produce  the  deltas  (difficult 
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Figure  2.4:  Effects  of  different  attack  methods  in  different  threat  models.  Legend:  A  - 
Retrieving  key  from  registers  fails.  B  —  Retrieving  key  from  RAM  fails  because  no  copy  is 
there.  C  —  Table  scan  fails  because  no  identifying  information.  DD  —  Doable  with  caveat 
(dispersed).  DS  —  Doable  with  caveats  (no  symbols).  E  —  Run  executable  in  emulator  or 
virtual  machine.  FI  —  Search  226  possibilities  for  actual  XOR  offset  and  actual  XOR  delta. 
F2  —  Search  236  to  determine  XOR  offset  control  value  and  XOR  delta  control  value.  G 
-  Circumventing  table  compile-time  constant  ordering  defense  requires  232.  H  —  Chunks 
encoded  with  16  bits  of  chaff  (per  chunk).  I  —  Chunks  have  2296  possible  orders.  S  - 
Attack  stage  would  succeed  given  the  caveat  in  parentheses.  Bold  items  indicate  best  key 
compromise  method  in  a  given  threat  type.  Notes  in  parentheses  indicate  caveats:  “Manual” 
means  requires  substantial  manual  work  for  a  highly-knowledgeable  and  skilled  attacker,  “if 
possible”  means  if  there  is  a  targeted  partial  disclosure  attack  that  somehow  Ends  only  the 
items  of  interest. 
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because  they  are  dispersed  throughout  the  process  memory  “DD”),  and  then  determine 
what  combinations  of  these  are  used  to  form  the  actual  offset  XOR  value  and  the  actual 
delta  XOR  value  (“F2”),  at  a  cost  of  236  different  guesses.  In  order  to  actually  follow  the 
decoded  pointers  and  reassemble  the  keys,  the  232  permutation  induced  by  a  compile-time 
random  value  (“G”)  must  be  reversed,  which  requires  considering  232  permutations  for  each 
of  those  236  guesses.  Thus  232  •  236  =  268  keys  must  be  examined  to  attack  via  the  index  table 
if  the  deltas  and  offsets  are  found  and  then  their  combinations  examined.  Since  directly 
determining  the  offsets  and  deltas  costs  226  (“FI”),  examining  232  permutations  for  each  of 
those  yields  a  cheaper  total  cost  of  258.  As  we  will  see,  this  is  the  most  efficient  attack,  so 
“DS”  “FI”  and  “G”  are  bolded  because  together  they  form  the  best  attack  for  this  column. 

Via  chunks.  In  this  case  the  chunks  must  first  be  located  from  dispersed  memory,  with 
no  particular  identifying  characteristics  (“DD”).  The  chunks  must  then  be  decoded,  which 
is  difficult,  since  each  has  been  XOR’d  with  its  own  random  16-bit  quantity  (“H”)  which  is 
stored  only  in  the  index  table  (breaking  this  is  prohibitively  expensive  because  individual 
chunks  can’t  be  verified,  e.g.,  a  1024-bit  key  has  64  16-bit  chunks,  so  216&1  =  21024).  Lastly, 
the  chunks  must  be  ordered,  but  there  are  2296  possible  orders  (“I”),  so  clearly  the  index 
table  attack  above  that  yields  258  possible  keys  is  faster. 

Computational  Cost  of  Best  Attack.  The  fastest  attack  for  the  Full  Disclosure  threat 
model  was  the  index  table  attack  that  yields  258  possible  keys.  258  =  2.9  *  1017,  meaning 
an  adversary  with  8  cores  that  can  each  check  1000  RSA  keys  per  second  (i.e. ,  1000  sign 
operations  per  second  per  core)  could  break  the  defense  to  recover  the  key  in  slightly  more 
than  a  million  years  (about  ten  million  CPU  years). 

2.5.2  Effects  of  the  Key  Compromise  Methods 

Here  we  elaborate  the  effects  of  the  key  compromise  methods  in  the  threat  models.  For 
example,  effect  A  is  what  occurs  when  an  attacker  launches  the  attack  “retrieve  the  key 
from  registers”  in  the  threat  model  of  “run  processes  on  machine” . 
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Effect  A:  Retrieving  key  from  registers  fails.  The  most  obvious  key  compromise 
method  is  to  steal  the  key  when  it  is  loaded  into  the  SSE  registers.  As  discussed  before, 
special  care  was  taken  to  prevent  this  attack  by  appropriately  disabling  interrupts,  so  that 
our  process  has  full  control  of  the  CPU  until  we  relinquish  it. 

Effect  B:  Retrieving  key  from  RAM  fails  because  no  copy  is  there.  The  second 
most  obvious  way  to  recover  the  key  is  if  it  was  somehow  “spilled”  from  the  registers  to 
RAM  during  execution.  We  conducted  experiments  to  confirm  that  this  does  not  happen. 
Specifically,  we  analyzed  RAM  contents  while  Apache  is  running  under  VMware  Server 
on  an  Intel  Pentium  930D.  The  virtual  machine  was  configured  as  a  512MB  single  CPU 
machine  with  an  updated  version  of  Ubuntu  6.06,  with  VMware  tools  installed.  A  Python 
script  generated  10  HTTP  SSL  connections  (each  a  10k  document  fetch)  per  second  for  100 
seconds.  Then  our  script  immediately  paused  the  virtual  machine,  causing  it  to  update  the 
.VMEM  file  which  contains  the  VM’s  RAM.  We  then  examined  this  RAM  dump  file  for 
instances  of  words  of  the  key  in  more  than  a  dozen  runs.  In  no  cases  were  any  words  of  the 
key  found. 

Effect  C:  Table  scan  fails  because  no  identifying  information.  The  attacker  can  seek 
to  find  the  index  table  by  scanning  for  plausible  contents.  Identifying  the  index  table  by  its 
contents  is  difficult  because:  (i)  the  chaff  is  low  entropy,  so  it  can’t  be  easily  used  to  find  the 
table;  (ii)  the  pointers  in  the  table  point  to  dynamically-allocated,  rather  than  consecutive, 
memory  addresses,  so  they  can’t  be  directly  used  either.  Examining  the  contents  of  the 
regions  pointed  to  by  the  potential  index  pointers  seems  to  be  the  attacker’s  best  approach. 
Some  candidates  can  now  be  ruled  out  quickly  because  they  point  to  invalid  locations  or 
locations  that  contain  entirely  zeroes.  However,  it  remains  quite  difficult  for  the  attacker 
to  decide  if  a  sequence  of  pointers  actually  does  point  to  the  chunk  and  filler,  because  it  is 
difficult  to  differentiate  a  pointer  to  a  location  that  contains  16  bits  of  scrambled  key  and  16 
bits  of  filler  from  a  pointer  to  any  other  location  in  memory. 

Effects  DD,  DS:  Doable  with  caveats.  These  symbols  are  used  to  mark  combinations 
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which  can  be  accomplished  but  require  a  cost  that  is  not  expressible  in  computational  terms. 
We  emphasize  the  security  of  our  scheme  is  never  reliant  on  these  factors;  they  are  merely 
additional  hurdles  for  the  attacker  to  surpass.  DD  indicates  that  finding  objects  is  theo¬ 
retically  possible  given  that  they  are  located  in  RAM  (and  more  precisely  in  the  address 
space  for  the  process  that  uses  the  key),  but  difficult  given  that  they  are  dispersed  non- 
deterministically  by  mallocQ,  an  effect  that  may  be  enhanced  by  also  allocating  fake  items 
of  the  same  size.  This  is  particularly  difficult  when  the  items  have  no  particular  identifying 
characteristics  that  readily  distinguish  them  from  other  values  in  memory.  True,  in  some 
instances,  such  as  the  chunks,  they  will  be  of  higher  entropy  than  the  surrounding  data,  but 
we  expect  that  it  would  be  hard  to  pick  out  a  single  16-bit  chunk  as  higher  entropy  than  its 
surroundings,  and  extremely  difficult  for  tiny  1-bit  chunks.  Still,  because  we  cannot  quantify 
the  difficulty  of  doing  this,  we  must  assume  that  it  is  possible.  DS  indicates  that  values  are 
statically  allocated  by  the  compiler  but  rather  difficult  to  find  because  we  do  not  include  any 
symbols,  meaning  they  are  simply  particular  bytes  in  the  BSS  (Block  Start  Symbol)  segment 
identified  only  by  their  usage  in  the  executable.  The  attacker’s  best  attack  is  to  rebuild  the 
executable  to  find  the  locations. 

Effect  E:  Run  executable  in  emulator  or  virtual  machine.  Executable  images  can 
exploited  by  executing  them.  We  believe  executing  disclosed  memory  images  enables  a  pow¬ 
erful  class  of  attacks,  which  have  not  been  previously  studied  to  the  best  of  our  knowledge. 
Namely,  an  attacker  can  acquire  a  full  memory  image  and  then  execute  it  inside  an  emulator 
or  virtual  machine,  where  its  behavior  can  be  examined  in  detail,  without  hardware  probes 
or  other  hard-to-obtain  tools.  Certain  hardware  state,  primarily  CPU  registers,  will  not  be 
contained  in  the  memory  image  and  must  be  obtained  or  approximated.  Since  operating 
systems  save  the  state  of  the  CPU  when  taking  a  process  off  of  it,  the  attacker  could  simply 
restore  this  state  and  be  able  to  execute  for  at  least  a  short  duration,  likely  at  least  until 
the  first  interrupt  or  system  call.  If  a  memory  image  was  somehow  obtained  just  before  our 
prototype  started  loading  the  MMX  registers  with  the  RSA  key,  this  basic  state  technique 
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would  probably  suffice  for  the  attacker  to  observe  what  values  are  loaded  into  the  registers 
on  the  emulator  (or  virtual  machine).  We  suspect  that  any  obfuscation  mechanism  that 
employs  software  will  be  amenable  to  some  form  of  this  attack.  Fortunately,  we  expect  this 
attack  will  require  significant  manual  work  from  a  highly-skilled  attacker.  There  might  be 
some  approach  that  could  allow  the  attacker  to  reduce  the  amount  of  manual  work  by  making 
a  significant  up-front  investment. 

Effect  FI:  Search  226  possibilities  for  actual  XOR  offset  and  actual  XOR  delta. 

In  order  to  interpret  the  index  table,  the  attacker  must  circumvent  the  offsets  and  deltas, 
as  explained  in  Section  2.3.3.  Since  these  have  a  range  of  264  and  232,  a  brute  force  search 
requires  296.  By  checking  each  value  found  in  memory,  rather  than  each  possible  delta  and 
offset,  the  search  space  can  be  reduced  substantially.  In  this  case  the  attacker  must  search 
each  possible  value  from  memory  (M)  and  then  compute  the  delta  and  offset  that  would 
match  it  on  each  index.  That  then  gives  a  delta  and  offset  which  can  be  used  to  interpret 
the  remainder  of  the  table.  Let  M  =  1  megabyte  =  220.  Assuming  a  1024-bit  key  broken 
into  16-bit  chunks,  table  size  t  =  ■fyjp  =  64  =  26.  So  that  gives  a  total  cost  of  M  ■  t  =  226  for 
breaking  the  XOR  offsets  and  deltas. 

Effect  F2:  Search  236  to  determine  XOR  offset  control  value  and  XOR  delta 
control  value.  In  order  to  interpret  the  index  table,  the  attacker  must  circumvent  the 
offsets  and  deltas,  as  explained  in  Section  2.3.3.  Assuming  the  attacker  has  somehow  found 
the  offsets  and  deltas  in  RAM,  let  us  examine  the  possibility  of  determining  the  control 
value  that  specifies  which  offsets  to  use  to  compute  the  XOR  offset  and  the  control  value 
that  specifies  which  delta  values  to  use  to  compute  the  XOR  delta.  Since  the  control  values 
have  a  range  of  232  and  216  (and  the  offsets  and  deltas  themselves  have  a  larger  range),  a 
brute  force  search  would  require  248.  Limiting  the  XOR  offset  to  a  plausible  set  of  values 
yields  a  search  space  of  220  for  the  offset  (i.e. ,  only  check  XOR  control  values  that  result  in 
pointer  values  that  address  within  the  data  segment,  which  we’ll  assume  is  1  M).  Since  the 
attacker  needs  to  find  the  offset  XOR  for  the  pointers  and  the  delta  XOR  for  the  chaffs,  the 
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search  space  is  220  •  216  =  236.  Note  that  since  these  values  cannot  be  verihed  to  be  correct 
until  an  RSA  sign  operation  verifies  the  actual  resulting  key,  this  236  is  a  multiplicative  factor 
in  the  computational  cost  of  finding  a  key  with  any  process  that  includes  this  step. 

Effect  G:  Circumventing  table  compile-time  constant  ordering  defense  requires 
232.  Section  2.3.3  describes  how  the  pointers  in  the  index  table  can  be  permuted  using  a 
compile-time  constant  providing  232  permutations.  In  order  to  discover  the  key,  the  attacker 
must  try  all  232  permutations  to  see  if  each  one  gives  a  key  that  produces  a  correct  result 
when  used. 

Effect  H:  Chunks  encoded  with  16  bits  of  chaff  (per  chunk).  Each  chunk  is  XOR’d 
with  its  own  chaff  (16  bits  of  random  data).  If  attacker  can’t  decode  and  validate  a  chunk  at 
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a  time,  brute-forcing  these  is  clearly  computationally  infeasible:  e.g.,  2lb  16  for  a  1024-bit 
key  in  16-bit.  chunks.  If  the  attacker  were  somehow  able  to  validate  an  individual  chunk, 
then  the  cost  is  only  216  •  1^,  which  is  negligible.  However,  since  a  chunk  is  merely  16  bits 
(or  even  1  bit  if  b  =  1  and  s  =  1)  of  high-entropy  data  with  no  particular  structure,  we 
cannot  conceive  any  way  an  attacker  could  validate  an  individual  chunk. 

Effect  I:  Chunks  have  2296  possible  orders.  Even  if  the  chunks  were  correctly  decoded, 
they  still  must  be  assembled  in  the  correct  order  to  form  the  key.  However,  even  for  a  1024- 
bit  key  broken  only  into  16-bit  pieces,  there  are  1089  permutations  of  the  pieces,  which  is 
approximately  2296. 

2.5.3  Security  Summary 

The  best  computational  attacks  (’’Full  Disclosure”  and  ’’Partial  Disclosure  Untargeted” 
columns)  require  checking  258  RSA  keys,  which  costs  about  10  million  CPU  years.  If  a 
special  targeted  partial  disclosure  attack  can  somehow  be  conceived,  there  is  a  232  attack, 
which  takes  some  computation  but  is  quite  feasible.  A  skilled  and  knowledgeable  attacker 
that  has  a  great  deal  of  time  and  patience  can  break  the  scheme  with  a  couple  of  different 
highly-manual  attacks:  either  reverse-engineering  the  particular  executable  on  the  attacked 
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system  and  applying  the  results  to  the  disclosed  image,  or  setting  up  a  carefully-timed  dis¬ 
closed  image  to  be  executed  on  an  emulator  or  virtual  machine  and  reading  the  key  from 
the  registers  when  they  are  populated. 

This  is  a  great  contrast  to  a  typical  system,  which  is  fundamentally  vulnerable  to  Shamir 
and  van  Someren’s  attacks  [61]  which  scan  for  high  entropy  regions  of  memory  (note  keys 
always  must  be  high  entropy  so  they  cannot  be  easily  guessed)  and  might  require  checking 
around  a  few  dozen  candidate  keys.  Recall  [34]  showed  that  unaltered  keys  are  visible  in 
RAM  in  the  common  real  systems  Apache  and  OpenSSH.  The  successful  attacks  shown 
in  [33]  suggest  that  typical  systems  are  likely  also  vulnerable  to  data-structure-signature 
scan  methods  to  find  Apache  SSL  keys  and  scans  for  internal  consistency  of  prospective  key 
schedules  to  find  key  schedules  for  common  disk  encryption  systems. 

From  this  analysis  we  see  that  our  defenses  would  be  especially  effective  against  auto¬ 
mated  malware  attacks,  which  we  expect  to  be  the  most  probable  threat  against  low-value 
and  medium-value  keys.  High-value  keys  may  be  worthwhile  for  an  attacker  to  specifically 
target  with  manual  effort,  but  we  expect  systems  using  those  will  likely  use  hardware  so¬ 
lutions  such  as  SSL  accelerator  cards  and  cryptographic  coprocessors.  Such  hardware  is 
too  expensive  for  most  applications,  but  provides  high  performance  as  well  as  hardware  key 
protection  for  high-end  applications. 

2.6  Performance  Analysis  of  Prototype 

Microbenchmark  performance.  First  we  examine  the  performance  of  RSA  signature 
operations  in  isolation.  Using  our  modified  version  of  OpenSSL  on  a  Core2Duo  E6400  dual 
core  desktop,  a  1024-bit  RSA  sign  operation  requires  8.8  ms  with  our  prototype  versus  2.0 
ms  for  unmodified  OpenSSL.  This  is  expected  because  we  can’t  use  Chinese  Remainder 
Theorem  (because  we  can’t  fit  p  and  q  into  the  registers  in  addition  to  d  due  to  their  space 
limitation).  Nevertheless,  our  prototype  just  used  the  most  basic  (and  therefore  slowest) 


square-multiplication  technique  for  modular  exponentiation  offered  by  OpenSSL,  which  could 
be  improved  by  using  Montgomery  multiplication. 

Apache  Web  Server  SSL  Performance.  Now  we  examine  the  performance  of  our  pro¬ 
totype  within  Apache  2.2.4,  using  a  simple  HTTPS  benchmark.  An  E6400  acts  as  the  client 
and  another  E6400  dual  core  desktop  on  the  same  100  Mbps  LAN  acts  as  the  server.  For 
the  first  test  we  initiate  10  SSL  connections  every  0.2  seconds,  fetching  a  ten  kilobyte  hie 
and  then  shutting  down.  The  0.2  second  interval  was  chosen  because  it  represented  a  rea¬ 
sonable  load  of  50  new  connections  per  second.  We  note  our  solution  is  not  expected  to  be 
used  for  high-throughput  servers,  which  would  often  use  special  hardware  for  accelerating 
cryptographic  processing.  The  result  is  that  average  query  latency  over  100,000  requests 
increases  from  about  80  milliseconds  for  unmodified  Apache  to  about  120  milliseconds  for 
the  prototype  (recall  all  10  queries  are  initiated  simultaneously,  which  slows  average  response 
time).  Average  CPU  utilization  also  increased  from  45%  to  61%.  From  this  we  conclude 
there  is  no  substantial  impact  on  observed  performance  under  reasonable  load,  and  that  the 
throughput  we  measured  should  be  sustainable  over  long  periods  of  time. 

In  many  ways  this  experimental  setup  represents  a  worst-case.  SSL  negotiation  including 
RSA  signing  is  done  for  every  transfer,  with  no  user  think  time  to  overlap  with,  whereas  we 
expect  real-world  SSL  connections  transfer  multiple  hies  consecutively  and  have  long  pauses 
of  user  think  time  where  other  requests  can  be  overlapped.  Moreover,  we  access  a  single  local 
Hie  that  will  doubtless  be  quickly  retrieved  from  cache,  whereas  we  expect  that  real-world 
HTTPS  interactions  will  frequently  require  a  disk  and/or  database  hit. 

We  also  demonstrate  the  scalability  of  our  prototype  systems.  Figures  2.5(a)  and  2.5(b) 
show  Apache  server  CPU  utilization  and  response  time  for  the  1024-bit  SSL  benchmark  as  a 
function  of  interval  in  seconds  between  sets  of  10  requests,  with  5000  requests  per  data  point, 
demonstrating  that  our  prototype  scales  about  as  well  as  Apache.  In  these  experiments,  the 
behavior  of  Apache  becomes  distorted  when  CPU  utilization  exceeds  approximately  70%; 
the  reason  for  this  is  unknown  but  may  be  because  of  scheduling.  This  can  be  seen  in  the 
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Server  CPU  utilization  (10  Parallel  Queries  (dual  core)) 


(a)  Apache  server  CPU  utilization 
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Figure  2.5:  Apache  SSL  benchmark  CPU  utilization  and  response  time,  as  function  of  interval 
in  seconds  between  sets  of  10  requests 


dips  and  valleys  on  the  left  of  Figure  2.5(a),  and  likely  causes  the  similarly-timed  aberrations 
on  the  left  of  Figure  2.5(b).  Because  each  data  point  is  from  only  5000  requests,  on  a  testbed 
which  is  not  isolated  from  the  department  network,  there  is  some  noise  which  causes  minor 
fluctuations  in  the  curve,  visible  on  the  right  of  Figure  2.5(b). 


2.7  Summary 

In  this  chapter  we  presented  a  method,  as  well  as  a  prototype  realization  of  it,  for  safekeeping 
cryptographic  keys  from  memory  disclosure  attacks.  The  basic  idea  is  to  eliminate  the  ap¬ 
pearance  of  a  cryptographic  key  in  its  entirety  in  RAM,  while  allowing  efficient  cryptographic 
computations  by  ensuring  that  a  key  only  appears  in  its  entirety  in  certain  registers. 
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Chapter  3 

ASSURED  DIGITAL  SIGNING  WITH 


THE  PROTECTED  MONITOR 

3.1  Introduction 

In  Chapter  2,  we  presented  a  mechanism  to  defeat  a  subclass  of  attacks  against  cryptographic 
keys;  namely  attacks  that  exploit  software  vulnerabilities  to  steal  cryptographic  keys  from 
memory.  In  this  chapter  we  move  a  step  further  to  investigate  a  more  powerful  subclass  of 
attacks:  namely  attacks  where  the  attacker  isn’t  primarily  attempting  to  compromise  the 
cryptographic  key  itself  (which  could  be  protected  with  special  hardware  or  the  solution  pre¬ 
sented  in  Chapter  2),  but  rather  to  compromise  the  corresponding  cryptographic  functions. 
As  a  side-effect,  our  new  mechanism  also  secures  the  key  itself  from  malware  attacks. 

Digital  signatures  are  a  widely  used  cryptographic  tool  for  assuring  various  authenticity 
needs,  including  non-repudiation,  sources  of  data  access  control  requests  (while  possibly 
protecting  privacy  if  desired  [IT]),  and  sources  of  data  items  or  software  programs  in  the 
form  of  provenance  for  evaluating  their  trustworthiness  [35,  80,  73].  However,  there  is  a  gap 
between  the  authenticity  offered  by  digital  signatures  in  the  abstracted  models  (see  [31]  for 
the  classic  and  standard  definition)  and  the  authenticity  required  by  real-world  applications. 
This  is  because  the  abstracted  models  (inevitably)  have  to  assume  away  some  attacks  that 
are  relevant  in  a  broader  security  context  otherwise. 

A  particular  type  of  such  attacks  was  called  hit-and-stick  but  left  as  an  open  problem  in 
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the  literature  [75].  In  this  attack,  the  attacker  (via  malicious  stealthy  malware)  penetrates 
into  a  computer  system,  while  possibly  evading  current  security  mechanisms.  As  a  conse¬ 
quence,  the  attacker  can  compromise  the  private  signing  functions  by  simply  feeding  any 
desired  message  to  the  program  or  device  that  holds  private  signing  keys  or  computes  cryp¬ 
tographic  functions.  The  concept  was  highlighted  many  years  ago  by  Loscocco  et  al.  [46], 
but  the  problem  has  since  then  remained  open  beyond  the  defense  offered  by  various  types  of 
intrusion  detection  systems  that  hopefully  can  detect  the  attack,  and  thus  the  cryptographic 
key  can  be  revoked.  Unfortunately,  as  discussed  in  [75],  this  is  far  from  sufficient. 

3.1.1  Our  Contributions 

In  this  chapter  we  address  the  hit-and-stick  attack  against  digital  signing  in  real-life  sys¬ 
tems.  Specifically,  we  present  the  design  of  a  general,  extensible  framework  for  enhancing 
the  authenticity  offered  by  digital  signatures.  The  framework  offers  digital  signatures  with 
systems-based  assurances 1  that  can  be  verified  by  the  signature  verifiers,  which  is  very  useful 
in  application  such  as  analyzing  the  trustworthiness  of  data  via  their  digitally  signed  prove¬ 
nance.  The  framework  utilizes  both  trusted  computing  and  virtualization  simultaneously.  It 
is  extensible  because  it  can  integrate  other  virtualization-based  security  mechanisms  so  as  to 
fulfill  a  more  comprehensive  security  solution  (rather  than  only  protecting  cryptosystems). 

Further,  we  present  a  concrete  implementation  and  evaluation  of  a  light-weight  system 
as  a  prototype  instantiation  of  the  general  framework  (Section  3.4).  The  core  of  our  solution 
is  a  novel  software  module  called  the  protected  monitor ,  which  is  a  light-weight  software 
substrate  beneath  the  guest  OS  kernel  but  residing  on  top  of  the  hypervisor,  and  might  be 
of  independent  value.  In  other  words,  it  is  less  powerful  than  the  hypervisor  but  is  more 
privileged  than  the  guest  OS  kernel. 

In  addition  to  providing  experimental  performance  evaluation  (Section  3.7),  we  conduct 
a  systematic  security  analysis  against  a  number  of  possible  threats  against  the  system,  which 

1This  can  be  thought  of  as  systems  security  repaying  cryptography  for  its  assistance. 
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shows  that  the  resulting  system  has  no  security  flaws  as  long  as  the  underlying  hypervisor 
is  secure  (Section  3.6).  Finally,  Section  3.8  presents  additional  detail  on  the  contribution  of 
the  protected  monitor. 

Our  complete  solution  has  several  desirable  features.  First,  private  signing  keys  are  not 
directly  accessible  from  the  user’s  (compromised)  VM,  even  via  raw  disk  access,  meaning  that 
malware  can  no  longer  easily  disclose  the  keys.  Second,  our  solution  does  not  require  any 
modifications  to  the  source  code  of  applications  that  use  properly-designed  cryptographic 
libraries,  which  greatly  increases  its  applicability.  Third,  applications  requesting  use  of  the 
key  can  be  attested  before  they  are  allowed  to  use  the  key. 

3.2  Design  Rationale  for  Our  Solution 

The  objective  of  assured  digital  signing  is  to  add  systems-based  security  assurance  to  the 
cryptographic  properties  of  digital  signatures  so  that  we  can  get  the  best  of  both  worlds  - 
systems  security  and  cryptography.  As  a  result,  the  signature  verifier  can  expect  greater 
trustworthiness  in  the  data  the  signature  vouches  for,  which  is  important  in  the  verifier’s 
decision-making  process. 

Our  proposed  framework  is  to  accompany  a  digital  signature  with  assertions  on  the 
system  state  under  which  the  signature  was  generated.  The  framework  is  general  in  the 
sense  that  it  can  accommodate  many  other  specific  techniques  for  monitoring  the  state  of  the 
system  and  can  be  integrated  into  a  large  class  of  security  mechanisms  for  a  comprehensive 
solution.  Moreover,  it  can  accommodate  architectures  that  already  offer  a  hardware  device 
for  conducting  cryptographic  computation.  Communication  layers  are  provided  to  make 
inter- VM  communication  transparent  to  the  application,  which  is  still  written  as  if  it  is 
invoking  a  cryptographic  service  provider  as  a  local  library.  2  In  Section  3.3  we  will  explore 

2Note  that  using  a  cryptographic  library  with  a  well-designed  API  is  a  best  practice  for  security  of 
cryptographic  applications,  as  compared  to  other  alternatives  such  as  writing  cryptographic  code  in-house. 
For  example,  this  allows  an  application  to  switch  to  a  different  cryptographic  library  if  implementation 
weaknesses  are  discovered. 
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the  solution  design  space  while  bearing  in  mind  the  above  requirements,  after  we  present  our 
threat  model  and  compare  related  work. 

3.2.1  Threat  Model 

We  use  the  same  standard  assumptions  typical  in  virtualization  security  architectures  ([56, 
28,  27,  40]): 

•  The  hypervisor  and  trusted  VM  are  in  the  trusted  computing  base  (TCB)  and  are  thus 
secure. 

•  Malicious  code  can  only  affect  the  guest  VM,  but  this  includes  the  guest  OS  kernel. 

•  The  hypervisor  provides  isolation  between  the  trusted  VM  and  the  untrusted  VM. 

Additionally,  we  assume  that  the  user  does  not  install  and  run  applications  in  the  trusted 
VM,  but  performs  all  user  activity  in  the  untrusted  VM.  One  way  to  achieve  this  for  most 
users  is  to  simply  make  the  trusted  VM  not  be  easily  accessible. 

This  threat  model  is  realistic,  as  it  assumes  the  attacker  can  do  anything  he  desires  to 
the  guest  VM,  including  inserting  both  user-space  and  kernel-space  malicious  code. 

3.2.2  Why  Hardware/TXT  Alone  Is  Not  Sufficient 

In  order  to  defeat  the  threat  of  software-based  attacks  against  private  signing  keys,  we  can 
certainly  store  them  in  hardware  devices  such  as  a  co-processor  [78]  or  Trusted  Platform 
Module  (TPM)  [32],  However,  it  is  much  more  difficult  to  defeat  software-based  attacks  that 
target  the  private  signing  functions  rather  than  the  private  signing  keys.  This  is  because 
once  the  attacker  penetrates  into  and  compromises  the  operating  system  (OS),  to  which  the 
hardware  devices  are  attached,  the  attacker  can  simply  request  the  hardware  devices  to  sign 
any  message  desired.  The  same  attack  disqualifies  both  Intel’s  Trusted  Execution  Technol¬ 
ogy  (TXT)  technique  [38]  and  AMD’s  Secure  Virtual  Machine  (SVM)  [2],  which  provide 
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a  hardware-protected  clean  execution  environment  on-demand  (i.e.,  without  re-booting  the 
system).  This  is  because  the  invocation  of  such  an  environment  is  realized  through  a  privi¬ 
leged  instruction,  which  however  can  be  launched  by  the  malware  that  has  compromised  the 
OS  kernel  already.  As  a  consequence,  this  also  disqualifies  follow-on  solutions  that  exploits 
the  TXT  technique  (e.g.,  [48]). 

One  obvious  countermeasure  to  this  attack  would  be  to  deploy  a  CAPTCHA  (Completely 
Automated  Public  Turing  test  to  Tell  Computers  and  Humans  Apart)  system  [14],  namely 
by  challenging  the  digital  signing  requester  to  solve  some  problems  that  can  only  be  solved 
by  a  human.  However,  this  solution  is  either  annoying  because  the  requester  has  to  solve  a 
CAPTCHA  challenge  for  every  digital  signing,  or  not  possible  because  the  signing  process 
is  invoked  automatically  by  other  applications  programs  (i.e.,  without  involving  a  human  in 
the  loop).  More  importantly,  this  solution  is  actually  not  secure  because  the  attacker,  who 
has  compromised  the  OS  and  controlled  the  communication  channel  between  the  user  and 
the  hardware  device,  can  launch  the  following  man-in-the-middle  attack:  The  attacker  can 
simply  utilize  the  user  to  help  it  solve  the  CAPTCHA  challenge  and  then  prompt  to  the 
user  that  the  solution  entered  last  time  to  the  CAPTCHA  challenge  was  incorrect.  The  user 
may  not  suspect  there  is  a  man-in-the-middle  attack  because  as  human  we  might  often  make 
mistakes  in  discerning  or  typing  solutions  to  CAPTCHA  challenges,  which  is  especially  true 
as  they  become  more  and  more  sophisticated  so  as  to  defeat  automatic  CAPTCHA  solvers 
[14].  The  use  of  TXT-based  trusted  I/O  may  be  able  to  defeat  this  man-in-the-middle 
attack  because  the  malware  cannot  incept  the  user’s  input.  However,  this  approach  has 
the  drawback  that  everything  else  running  on  the  system  has  to  be  frozen  in  order  to  run 
the  system.  Moreover,  as  mentioned  above,  the  fact  that  no  human  is  involved  in  many 
applications  disqualifies  this  solution. 

3.2.3  Comparison  to  Other  Possible  VMM-based  Approaches 

Stock  hypervisors.  Ordinary  stock  hypervisors  can  be  used  to  isolate  an  untrusted  domain 
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from  a  trusted  domain.  Applications  in  the  untrusted  domain  could  then  request  secure 
services  from  applications  in  the  trusted  domain  via  some  typical  interdomain  communication 
mechanism.  There  are  two  disadvantages  to  this  approach: 

•  There  is  no  mechanism  for  the  trusted  domain  to  vet  the  application  in  the  untrusted 
domain  requesting  the  service.  For  example  the  application  in  the  untrusted  domain 
could  be  the  victim  of  a  code-injection  attack  launched  by  malware.  (Systems  aug¬ 
mented  with  VM  introspection  are  discussed  below.) 

•  Ordinary  interdomain  communication  mechanisms  rely  on  the  kernel  to  set  up  or  per¬ 
form  the  communication.  In  either  case,  a  malicious  kernel  can  easily  disrupt  the 
service. 

In  contrast,  our  protected  monitor  approach  has  the  following  advantages: 

•  Our  design  allows  the  trusted  domain  to  verify  the  code  of  the  application  sending  a 
message,  as  well  as  to  protect  the  code  of  that  application  against  modification. 

•  Our  design  provides  for  fast  high-throughput  communication  from  the  untrusted  do¬ 
main  to  the  trusted  domain  and  also  vice-versa. 

•  Our  design  removes  the  kernel  from  the  interdomain  communication  process,  reducing 
the  communication  threat  of  a  malicious  kernel  to  a  denial-of-service  attack. 

VM  introspection.  Our  in-VM  protected  monitor  is  much  more  powerful  than  VM  intro¬ 
spection  for  two  reasons: 

•  First,  it  greatly  reduces  the  semantic  gap  problem,  created  by  attempting  to  understand 
the  semantics  of  operations  of  a  virtual  machine  from  outside  the  VM.  The  protected 
monitor  allows  us  to  securely  run  operations  within  the  VM,  where  the  semantic  gap 
does  not  apply.  For  example,  instead  of  examining  what  an  external  observer  expects 
to  be  the  process  list  and  inferring  what  particular  values  mean,  the  protected  monitor 
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could  call  into  the  kernel  to  get  a  list  of  processes  directly.  However,  this  does  not  fully 
and  securely  resolve  the  semantic  gap  problem  because  invoking  functionality  that  is 
only  implemented  in  the  guest  OS  kernel  or  applications  is  still  subject  to  attacks  on 
those  components,  even  though  attacks  on  the  monitor  itself  are  not  possible.  Instead, 
it  gives  the  system  a  selectable  tradeoff  between  completeness  of  information  and  secu¬ 
rity  of  the  information  acquisition,  such  as  in  the  above  process  list  example  where  we 
may  be  willing  to  trade  some  risk  of  getting  compromised  information  from  a  compro¬ 
mised  kernel  in  exchange  for  full  information  on  processes  regardless  of  kernel  version. 
In  this  work  we  choose  to  maximize  security  of  information  acquisition.  Note  that 
because  checking  the  accuracy  of  information  is  often  easier  than  acquiring  the  infor¬ 
mation  directly,  we  believe  the  protected  monitor  may  be  able  acquire  completeness  of 
information  without  necessarily  reducing  the  security  of  information  acquisition.  We 
leave  study  of  the  full  potential  of  the  completeness-security  tradeoff  and  mitigation 
techniques  to  future  work. 

•  Secondly,  the  protected  monitor  provides  for  secure  and  efficient  two-way  communi¬ 
cation  between  a  trusted  VM  and  userland  applications  in  an  untrusted  VM,  without 
relying  on  the  kernel  during  communication.  In  fact,  not  only  can  the  kernel  not  inter¬ 
cept  nor  block  the  communication  process,  but  the  application  itself  is  protected  during 
communication  by  features  such  as  write-protecting  its  executable  and  shared  libraries 
from  modification  by  the  kernel  or  other  applications,  even  privileged  applications. 
This  prevents  code  injection  from  occurring  during  communication,  and  allows  the 
system  to  measure  an  executable  only  when  the  communication  process  begins  rather 
than  on  every  communication,  which  substantially  reduces  the  overhead  of  measuring 
the  executable. 
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3.3  System  Logical  Design 


In  the  above  we  have  discussed  why  hardware/TXT  alone  is  not  sufficient  to  defeat  the  hit- 
and-stick  attack  against  digital  signing.  The  cause  of  this  phenomenon  is  that  the  attacker 
can  penetrate  into  the  OS  of  the  victim  computer  and  thus  impersonate  the  user  or  user 
program.  The  ideal  solution  to  this  problem  is  to  ensure  that  the  OS  is  never  penetrated, 
which  is  however  a  grand  challenge  that  might  remain  open  for  decades.  For  a  practical  and 
feasible  solution,  we  would  have  to  make  some  assumption  that  there  is  some  small  Trusted 
Computing  Base  (TCB)  in  the  software  stack.  This  leads  to  the  architectural  framework 
depicted  in  Figure  3.1,  where  the  small  TCB  is  naturally  realized  by  the  hypervisor.  We 
assume  the  hypervisor  is  secure,  which  is  an  active  research  topic  [66,  71,  5]. 

Capturing  dynamic  system  properties  is  an  important  yet  challenging  research  problem 
that  remains  to  be  tackled  [41,  76].  Our  approach  is  orthogonal  to  efforts  in  that  because  we 
can  take  advantage  of  them  in  a  plug-and-play  fashion.  This  also  applies  to  research  that 
aims  to  ensure  kernel  integrity  and  detect  kernel  rootkits  [47,  16]. 
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Figure  3.1:  Logical  design  of  solution  framework  to  the  hit-and-stick  problem  (dashed  arrows 
represent  logical,  rather  than  physical,  communication  flows) 


In  this  framework,  a  signature  verifier  verifies  not  only  the  cryptographic  validity  of  a 
digital  signature  (i.e.,  that  the  signature  is  valid  with  respect  to  the  claimed  public  key 
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that  was  not  revoked),  but  also  the  attestation  about  the  system  environment  in  which 
the  signature  was  generated.  Because  we  want  to  prevent,  rather  than  just  detect  after 
the  fact,  attacks  against  the  signer’s  computer,  the  attestation  ideally  should  include  state 
information  such  as  whether  the  system  (especially  the  application  program  that  issued  the 
signature)  is  under  attack  or  suspicious.  Correspondingly,  the  signer’s  system,  which  needs 
to  collect  the  relevant  information,  is  characterized  as  follows.  We  separate  the  applications 
from  the  signing  server  because  we  want  to  make  our  solution  extensible  so  as  to  integrate 
with  other  existing  and  to-be-developed  solutions.  We  note  that  it  is  relatively  more  easy  to 
protect  the  cryptographic  server  than  to  protect  the  cryptographic  service  requester  because 
the  former  is  almost  always  static,  whereas  the  latter  resides  in  a  system  that  often  needs 
to  be  updated  with  new  software  programs  or  their  patches.  This  justifies  why  we  use 
a  trusted  VM  for  the  actual  signing  program,  while  the  application  runs  in  an  untrusted 
VM.  This  allows  to  integrate  with  existing  and  future  VM-based  introspection  solutions 
(such  as  those  mentioned  above)  for  a  more  comprehensive  solution.  Moreover,  we  use  the 
protected  monitor  with  the  user’s  untrusted  VM,  which  could  be  integrated  with  other  in- 
VM  introspection  security  mechanisms.  This  is  appealing  because  in-VM  introspection  has 
certain  advantages  over  out-of-VM  introspection.  The  protected  monitor  is  not  designed  to 
be  a  part  of  the  TCB  because  we  want  to  make  as  few  changes  to  the  TCB  as  possible.  The 
protected  monitor  is  a  security-critical  module  that  will  reside  directly  on  the  TCB.  Moreover, 
the  protected  monitor  can  integrate  existing  countermeasures  against  the  compromise  of  the 
software  making  requests. 

3.4  System  Physical  Design 

Figure  3.2  depicts  the  the  overall  physical  design  of  our  signer  system.  We  choose  Xen  as  our 
platform  because  the  code  is  freely  available.  The  physical  design  details  many  issues  that 
were  abstracted  away  at  the  logical  design  mentioned  above.  In  particular,  the  trusted  VM 
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is  the  Xen  domain  0  (sometimes  referred  to  with  the  shorthand  dornO),  and  the  untrusted 
VM  is  a  Xen  domain  U  (sometimes  referred  to  with  the  shorthand  domU).  In  what  follows 
we  elaborate  on  the  main  components  in  the  system. 


Figure  3.2:  Overall  software  architecture  of  the  system 


3.4.1  System  Components  in  Xen 

The  relevant  mechanisms  where  our  system  needs  support  from  the  hypervisor  are:  memory 
protection,  hypercalls,  and  call  gates.  In  the  below  we  elaborate  them. 

Memory  protection.  Xen-enabled  memory  protection  is  a  key  component  of  our  security, 
because  it  allows  us  to  protect  data  and  code  from  modification  by  an  attacker  in  the  domain 
U.  Most  importantly,  we  need  to  provide  memory  protection  for  the  protected  monitor  itself 
in  order  to  protect  the  protected  monitor  from  being  compromised.  We  protect  4  megabytes 
of  memory  in  the  domain  U,  and  use  each  for  a  separate  purpose.  The  individual  megabytes 
are  designated  as  M0,  M\ ,  M2,  and  M3,  respectively. 
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A  trusted  VM  runs  a  backend  monitor  that  can  persist  information.  (Note  that  hyper¬ 
visors  typically  have  no  ability  to  persist  data,  and  the  hie  store  in  the  user  VM  cannot 
necessarily  be  trusted.)  Highly  secure  code  that  authors  do  not  wish  to  port  to  run  within 
the  monitors  themselves  can  also  be  run  within  the  trusted  VM,  allowing  them  to  make  use 
of  ordinary  operating  system  services. 

We  modify  the  Xen  hypervisor  to  add  memory  protection  for  the  protected  monitor  that 
sits  within  the  user  VM,  and  also  to  augment  Xen  to  allow  inter- VM  communication  without 
having  to  rely  on  the  user  VM  kernel  and  operating  system  in  any  way.  The  protected  monitor 
within  the  user  VM  can  handle  simple  access  control  decisions  without  having  to  cross  the 
VM  boundary.  More  complex  decisions,  including  decisions  that  are  best  made  outside  the 
VM,  are  sent  to  the  backend  monitor  in  the  Secure  VM,  which  will  be  in  Xen’s  Domain  0. 

User  applications  are  run  inside  the  user  VM,  where  the  protected  monitor  has  been 
inserted  above  the  kernel.  Our  protected  monitor  can  be  seen  as  superior  to  the  kernel, 
meaning  that  the  protected  monitor  is  not  only  difficult  to  attack,  but  could  be  used  to 
mediate  kernel  actions  if  desired.  The  protection  is  achieved  by  using  virtual  machine  page 
protections  to  protect  a  region  of  kernel  memory  where  our  protected  monitor  will  reside. 
This  memory  is  protected  against  execution  and  modification,  except  during  a  special  mode 
that  only  applies  when  the  monitor  is  executing. 

Hypercalls.  We  extend  Xen’s  hypercall  mechanism  to  provide  six  additional  hypercalls  to 
support  our  system  design.  Two  are  invoked  by  domain  0  only:  one  to  send  info  about  the 
shared  memory,  and  one  to  set  the  GDT.  One  is  invoked  by  domain  0  or  domain  U,  to  send 
a  Xen  virtual  IRQ  (VIRQ).  One  is  invoked  only  by  the  domain  U  kernel  module,  to  map  the 
entire  4M  shared  memory  at  once.  One  is  invoked  in  domain  U  inside  the  call  gate  3  code, 
which  allows  it  to  performs  privileged  operations:  map  and  unmap  the  M2  and  M3  and  send 
message  to  domain  0.  Lastly,  one  is  invoked  by  the  exit  page  in  domain  U,  to  restore  the 
page  protections  to  their  state  before  the  PM  was  invoked.  These  will  be  further  explained 
as  we  go  through  the  system  operational  sequence. 
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Call  gate.  We  utilize  the  call  gate  mechanism  provided  by  x86  hardware  in  order  to  escalate 
privilege  from  ring  3  (user  mode  applications  in  dom  U)  to  ring  1  (the  ring  level  for  the 
domU  kernel  and  our  protected  monitor).  The  unique  feature  of  the  call  gate  mechanism 
is  that  we  can  raise  the  privilege  level  without  modifying  or  using  the  kernel  or  its  data 
structures.  Normally  the  kernel  would  control  access  to  the  Global  Descriptor  Table  (GDT) 
(which  specifies  call  gates  and  other  system  descriptors)  but  in  a  Xen  system  this  access  is 
controlled  by  Xen.  We  modify  the  Xen  code  and  tables  that  initialize  this  table.  All  of  our 
call  gates  are  set  to  execute  code  in  M0. 

In  theory  privilege  escalation  could  be  achieved  using  system  calls  or  hypercalls,  which 
we  use  in  other  places  and  are  more  typical  mechanisms  for  escalation.  Using  call  gates 
rather  than  system  calls  or  hypercalls  has  the  following  advantages: 

•  Call  gates  allow  us  to  hash  the  dom  U  application  and  know  we  have  the  correct 
process,  since  we  read  the  CR3  directly  from  the  registers  the  process  was  using. 

•  Call  gates  prevent  modification  of  the  CR3  or  page  table  by  the  attacker,  since  we 
know  the  CR3  and  page  table  the  process  was  actually  using  when  it  invoked  the  gate. 

•  Call  gates  allow  the  application  to  know  it  will  actually  invoke  Xen  (because  Xen 
controls  access  to  the  GDT),  rather  than  some  program  in  domain  U  pretending  to  be 
the  hypervisor. 

•  Communication  via  call  gates  ensures  the  kernel  can’t  selectively  block  messages.  (I.e., 
if  sent  messages  via  kernel  module,  kernel  could  selectively  block  some  messages.) 

3.4.2  System  Components  in  Domain  U 

The  main  system  component  in  the  user  domain  is  a  new  substrate  we  call  the  protected 
monitor.  The  function  of  the  protected  monitor,  which  is  a  core  part  of  the  system,  is  to  allow 
userspace  domain  U  applications  to  communicate  directly  and  securely  with  domain  0.  The 
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main  issue  encountered  in  the  design  of  the  protected  monitor  is  to  enable  communication 
between  domain  U  and  domain  0  without  the  support  of  kernel. 

Stub.  The  stub  layer  automatically  marshals  and  demarshals  cryptographic  library  calls 
and  forwards  the  calls  to  domain  0,  providing  transparent  access  to  the  service  provider  in 
domain  0. 

The  stub  exists  to  allow  the  user  application  to  transparently  invoke  what  appear  to 
be  ordinary  library  calls.  However,  instead  of  the  request  being  processed  inside  the  local 
library,  it  automatically  translates  them  into  requests  that  travel  via  the  protected  monitor 
to  be  served  by  the  service  provider  in  domain  0.  The  stub  code  declares  functions  in  the 
cryptography  library  so  that  the  user  code  can  link  against  them  just  like  linking  against 
a  static  or  dynamically-linked  implementation  of  the  cryptography  library.  Since  the  defi¬ 
nitions  of  the  functions  accept  the  library  arguments  and  marshal  them  appropriately  and 
send  them  to  domain  0  which  then  processes  them  and  then  the  stubs  deserialize  the  reply, 
the  user  application  is  completely  unaware  that  the  operations  are  not  implemented  directly 
in  the  library. 

Kernel  module.  The  kernel  module  enables  user  processes  to  invoke  certain  hypercalls, 
since  user  processes  cannot  invoke  hypercalls  directly.  Invoking  the  kernel  module  is  also 
faster  than  using  invoking  a  call  gates,  so  best  to  not  use  gates  for  everything.  The  downside 
of  using  a  kernel  module  is  that  a  compromised  kernel  could  prevent  it  from  operating.  For 
this  reason  we  never  use  the  kernel  module  for  security-sensitive  operations,  only  to  set  up 
and  tear  down  the  system.  If  those  operations  fail,  the  result  is  merely  a  denial-of-service. 

Protected  Monitor.  When  an  application  process  in  domain  U  requires  a  cryptographic 
service,  it  invokes  the  cryptographic  service  provider  stub.  The  stub  uses  a  call  gate  to 
invoke  an  appropriately  marshalled  hypercall  (including  the  identity  of  the  specific  function 
that  is  requested)  so  as  to  send  a  Xen  event  across  across  a  channel  to  the  secure  VM.  Note 
that  some  privilege  escalation  must  be  done  by  Xen  hypercalls  rather  than  the  call  gates  or 
invoking  the  kernel  module.  This  is  because  hyp  ere  alls  are  the  only  way  to  communicate  with 
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hypervisor.  Note  that  due  to  the  design  of  Xen  (and  other  typical  hypervisors) ,  hypercalls 
can  only  be  invoked  directly  from  code  in  the  kernel  or  a  kernel  module,  so  we  could  not 
implement  communication  from  userspace  securely  and  efficiently  using  only  hypercalls. 

3.4.3  System  Components  in  Domain  0 

The  system  components  in  the  trusted  VM  include:  (i)  backend  monitor,  (ii)  remote  attes¬ 
tation  service,  (ii)  crypto  service,  (iv)  disk.  Below  we  describe  the  components  in  detail. 

Backend  monitor.  This  is  the  counterpart  to  the  protected  monitor  inside  the  trusted  VM. 
It  has  3  major  functions:  facilitating  communication  (see  Section  3.4.4),  determining  which 
communication  requests  to  approve  or  deny  (the  policy  engine),  and  inspecting  the  domain 
U  caller  generating  a  request.  The  most  complex  job  of  the  backend  monitor  is  inspecting 
the  domain  U  caller.  The  backend  monitor  has  three  primary  responsibilities: 

•  Facilitating  communication.  Upon  receiving  the  event,  Xen  maps  in  memory  pages  that 
were  transferred  to  the  secure  VM  in  order  to  read  the  marshalled  function  number 
and  arguments.  A  stub  layer  for  the  cryptographic  service  provider  will  recreate  the 
actual  C  language  invocation  from  that  data.  Backend  Monitor  in  Domain  0  receives 
the  VIRQ’s  and  translates  them  into  appropriate  user-level  library  invocations,  which 
requires  unmarshalling  the  arguments. 

•  Policy  engine.  The  goal  of  the  policy  engine  is  to  allow  the  creation  of  flexible  policies 
for  approving  and  denying  requests  made  via  the  backend  monitor,  based  on  decision 
criteria  available  to  the  backend  monitor,  such  as  whether  hash  values  match.  This  is 
relatively  straightforward  from  a  coding  perspective  and  we  did  not  implement  it  other 
than  some  simple  checks. 

•  Inspecting  the  domain  U  caller.  In  order  to  establish  the  authenticity  and  integrity  of 
an  executing  program  that  claims  the  right  to  use  a  certain  key,  we  need  to  authenticate 
the  caller.  A  typical  solution  for  such  a  problem  would  be  to  compute  a  hash  of  the 
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executable  image  in  memory.  This  presents  two  problems:  how  do  we  know  what  the 
hash  of  an  executable  should  be,  and  how  do  we  deal  with  the  need  for  updates  to  an 
executable,  which  will  change  its  hash?  For  a  program  that  does  not  ever  change  the 
answer  is  simple  enough:  if  the  hash  of  the  program  that  requested  generation  of  the 
key  is  the  same  as  the  hash  of  the  program  that  requested  use  of  the  key,  then  the 
use  should  be  permitted.  For  the  more  common  case  of  a  program  whose  executable  is 
periodically  updated,  a  more  sophisticated  mechanism  is  required.  Here  we  introduce 
the  concept  of  the  provenance  of  an  executable.  By  this  we  mean  establishing  a  trail 
that  establishes  how  an  executable  was  obtained  or  from  what  source  it  originated. 
When  an  application  initially  creates  a  key,  we  compute  a  hash  of  the  executable  and 
check  it  against  a  signature  provided  by  the  publisher.  If  the  signature  matches,  we 
then  record  the  publisher  as  having  the  right  to  produce  future  applications  of  the 
same  name  that  can  use  this  key.  Neither  applications  from  other  publishers  nor  other 
applications  from  this  publisher  have  the  right  to  use  the  key. 

It’s  important  to  note  that  we  hash  the  executable  at  the  first  call  gate,  and  then 
lock  the  executable  pages  so  they  can’t  be  modified.  This  is  for  two  reasons:  (i)  The 
performance  impact  is  lower,  since  there  is  a  hash  at  the  beginning  of  communication 
instead  of  every  time  a  message  is  sent,  (ii)  This  prevents  subtle  TOCTOU  (time-of- 
check  time-of-use)  attacks  which  would  otherwise  be  possible  (e.g.,  changing  the  binary 
just  before  sending  the  message,  then  somehow  changing  it  back  afterwards). 

Some  technical  issues  need  to  be  resolved  in  order  to  compute  this  hash.  First,  we 
need  to  know  what  comprises  the  executable,  while  avoiding  any  dependency  on  the 
kernel  as  far  as  possible.  Second,  we  wish  to  perform  this  operation  efficiently  since 
a  process  could  have  a  large  set  of  pages.  After  evaluating  alternatives,  we  chose  to 
examine  the  pages  in  the  user  process  code  segment.  This  gives  us  the  executable  and 
all  libraries,  including  shared  libraries,  while  avoiding  any  reliance  on  the  kernel  or  its 
data  structures  and  still  giving  better  performance  than  other  options. 
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Crypto  service.  This  component  provides  the  cryptography  service.  It  consists  of  two 
pieces:  the  cryptography  library  itself  and  a  wrapper  which  enables  the  library  to  receive 
calls  made  across  the  VM  boundary. 

Disk.  This  is  simply  the  ordinary  disk  in  domain  0.  Note  that  there  is  no  way  for  the 
domain  U  to  access  the  domain  0  disk,  so  any  information  on  the  domain  0  disk  is  secure 
from  the  domain  U. 

The  disk  is  important  because  it  stores  the  keys  used  by  the  crypto  service,  as  well  as 
the  implementation  of  that  service  and  all  other  domain  0  software  components. 

Attestation  service.  Attestation  criteria  in  our  concrete  implementation  include  the  fol¬ 
lowing:  (i)  static  measurement  of  boot  and  kernel  (using  TPM);  (ii)  using  a  secured  cryp¬ 
tography  library;  (iii)  authentication  of  the  requesting  program  (measuring  the  binary  and 
libraries);  (iv)  trusted  path  user  confirmation  dialog. 

Figure  3.3  depicts  the  optional  trusted  path  user  confirmation  dialog.  This  runs  from 
domain  0  so  that  it  displays  directly  on  the  console  using  X  Windows  and  enables  the  user  to 
explicitly  approve  each  signature  request  made  with  their  key.  While  our  prototype  simply 
records  the  bytes  of  the  message  and  shows  the  corresponding  hie  type,  a  full  implementation 
could  feed  the  bytes  to  a  document  viewer  so  that  the  user  could  see  the  actual  document 
being  signed  (if  it  is  of  a  type  that  is  a  viewable  document).  Because  this  dialog  is  part  of 
the  domain  0  service  provider  code,  its  operation  is  completely  transparent  to  domain  U, 
which  is  completely  unaware  of  its  existence,  except  for  two  changes  in  the  behavior  of  the 
signature  request  call:  1.  the  call  does  not  return  until  the  user  indicates  their  decision,  and 
2.  well-formed  signature  requests  will  fail  if  the  user  disapproves  the  request. 

Trousers.  Trousers  is  an  open-source  implementation  of  the  TCG  Software  Stack  (TSS), 
created  and  released  by  IBM.  This  enables  domain  U  applications  to  access  the  TPM  using 
the  software  API  designed  by  the  Trusted  Computing  Group. 
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Allow  Signature?  _  n  x 


Signature  with  private  key  requested 
Key  name:  Parker  private  signing  key  1 
Document  hash:  2b2d2d90445a73e34bcdd290eb1 537ea20ef7383 
Document  file  type:ASCII  text 
Do  you  wish  to  allow  this  signature? 

Deny  Allow 


Figure  3.3:  User  signature  confirmation  dialog  (optional) 


3.4.4  Putting  the  Pieces  Together 

Shared  memory  and  communication  flow.  Shared  memory  is  the  mechanism  we  use  for 
efficient  communication  between  the  trusted  VM  and  the  user  VM.  By  mapping  the  same 
pages  into  both  VM’s,  messages  can  be  sent  from  one  domain  to  the  other  without  any  copy 
operation,  making  message  transmission  a  fixed  cost  irrespective  of  message  size,  which  is 
important  since  users  may  request  signatures  on  large  amounts  of  data. 


For  the  protected  monitor  we  allocate  1024  physical  pages  of  memory  (4MB).  The  first 
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256  physical  pages  (M0)  contain  the  wrapper  function  that  used  to  invoke  the  hypercall  to 
request  trusted  VM  to  send  to  the  crypto  service.  The  second  256  physical  pages  (Mi)  are 
used  to  store  parameters  and  the  measurement  of  the  user  VM  application’s  code  segment, 
the  user  VM  system  call  table,  and  the  IDT  (Interrupt  Descriptor  Table).  The  third  256 
physical  pages  (M2)  used  for  user  VM  application  write  the  message  that  need  to  sign.  The 
last  256  physical  pages  (M3)  are  used  for  trusted  VM  to  return  a  result  back  to  the  untrusted 
VM. 

There  are  three  parts  of  virtual  pages  map  to  the  physical  pages:  (i)  In  user  VM’s  kernel 
space  the  protected  monitor  maps  the  M0  and  Mi.  (ii)  In  user  VM’s  user  space  the  user 
application  maps  the  M2  and  M3.  (iii)  In  trusted  VM’s  user  space  the  crypto  service  maps 
the  Mi,  M2  and  M3.  Because  (i)  &  (ii)  are  both  in  the  user  VM,  the  page  tables  of  these 
virtual  pages  need  to  be  protect  by  memory  protection  in  Xen. 

Recall  that  our  design  goal  is  to  require  no  code  changes  for  the  user  applications,  so 
we  simply  relink  it  against  a  stub  library,  which  is  particularly  easy  if  the  application  is 
dynamically-linked.  This  stub  library  must  achieve  a  communication  layer  where  inter- VM 
communication  is  completely  hidden  from  the  ordinary  user-space  application,  which  is  still 
written  as  if  it  is  invoking  the  CSP  (Cryptographic  Service  Provider)  as  a  local  library.  In 
order  to  fulfill  secure  kernel-free  communication  without  making  any  modifications  to  the 
domain  U  OS  kernel,  we  need  to  realize  privilege  escalation  as  follows. 

Figure  3.5  summarizes  the  steps  in  system  execution,  with  emphasis  on  message  flow 
between  entities.  During  the  preparatory  step,  kernel  modules  are  loaded  in  domain  0  and 
domain  U.  It  is  important  to  note  that  the  domain  U  kernel  module  is  not  used  to  implement 
any  security-sensitive  functionality.  If  the  domain  U  kernel  blocked  it  or  blocked  some  of  its 
functionality,  it  would  be  able  to  achieve  only  a  denial-of-service  attack.  All  interrupts  are 
sent  and  received  through  ioctl  operations  on  the  device  hies  that  are  the  interface  to  the 
kernel  modules.  Here  are  the  actual  system  steps  as  executed  under  the  direction  of  user 
land  applications  in  domain  0  and  domain  U: 
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Figure  3.5:  A  high-level  description  of  the  system  control  flow 


1.  The  kernel  module  devices  are  opened,  which  causes  them  to  register  themselves  to 
handle  certain  virtual  interrupts  (software  interrupts  generated  by  Xen). 

2.  Domain  U  uses  the  kernel  module  to  request  that  the  hypervisor  send  an  interrupt  to 
domain  0.  This  interrupt,  ”irql”,  is  used  to  signify  that  a  client  is  starting  up. 

3.  When  the  domain  0  application  receives  this  interrupt,  it  allocates  4  megabytes  of 
memory. 

4.  The  domain  0  application  then  creates  the  shared  memory  and  uses  two  pages  to  store 
the  references  to  the  4  megabytes  of  shared  memory  (each  page  of  shared  memory  has  a 
reference  so  1024  pages  of  shared  memory  have  1024  references,  later  in  step  5  our  new 
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hypercall  can  use  these  to  map  the  4  megabytes)  and  MFN  (machine  frame  number, 
for  step  5  to  use  to  protect  the  domU  mapping  of  the  shared  memory).  It  then  uses 
a  hypercall  to  send  the  location  of  the  shared  memory  and  the  pages  of  reference  and 
MFN  to  Xen.  That  will  allow  domain  U  to  use  the  new  hypercall  to  map  this  memory. 

5.  After  waiting  on  IRQ1,  domain  U  invokes  a  special  hypercall  to  map  the  megabytes  into 
kernel  user  address  space  for  the  domain  U  application.  (Mapping  the  memory  into  its 
address  space  is  what  allows  domain  U  to  “share”  this  memory  with  domain  0.)  This 
hypercall  is  different  from  the  one  ordinarily  used  to  share  memory  in  Xen,  because 
of  our  special  security  requirements.  This  memory  is  read-only  and  “map-protected,” 
which  prevents  it  from  being  mapped  using  normal  Xen  sharing  hypercalls.  This  is 
discussed  in  more  detail  in  Section  3.6.  domain  U  then  indicates  that  it  has  finished 
mapping  the  shared  memory  by  sending  IRQ  2. 

6.  When  domain  0  receives  IRQ2,  it  modifies  the  GDT  (the  x86  Global  Descriptor  Table) 
to  install  the  call  gates  for  use  in  domain  U.  It  then  installs  the  wrapper  function  in 
M0,  where  it  will  be  invoked  by  domain  0,  and  exit_page  in  the  last  page  of  AR.  The 
last  page  of  Mi  is  always  write-protected,  so  that  domain  U  cannot  write  it.  Mq  cannot 
be  written  from  domain  U  normally,  but  becomes  writable  while  domain  U  is  inside 
call  gate  2.  Domain  0  then  sends  IRQ2  to  domain  U  to  indicate  that  the  call  gates  are 
set  up. 

7.  When  domain  U  receives  IRQ2,  it  can  then  invoke  the  first  call  gate.  The  effect  of  the 
call  gate  is  to  raise  the  CPU  privilege  level  to  ring  1  from  the  user-level  of  ring  3  and 
begin  executing  code  at  a  specified  location.  At  the  same  time,  the  CR3  and  page  table 
in  use  do  not  change,  so  hypercalls  can  be  made  directly  from  the  user  application  and 
operate  on  the  page  table  of  the  user  application. 

This  Mq  code  for  call  gate  1  simply  invokes  a  hypercall  (which  is  otherwise  not  possible 
without  going  through  the  kernel) .  This  hypercall  is  used  to  map  M2  and  M3  memory 
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into  the  user  process.  M3  is  set  read-only  and  M2  can  be  read  and  written. 

A  hash  is  then  computed  for  the  domain  U  application  that  invoked  the  call  gate,  from 
the  Xen  hypervisor.  Because  the  call  gate  invocation  retained  the  CR3  and  page  table 
of  the  process,  this  uses  the  page  tables  of  the  process,  which  prevents  various  attacks 
that  try  to  substitute  different  code  when  a  process  is  being  hashed.  By  using  the 
page  table  of  the  process,  we  can  ensure  these  are  the  same  pages  the  process  would 
actually  access  and  execute.  The  executable  pages  of  the  application  are  then  marked 
read-only  (if  they  weren’t  already)  and  Xen  is  informed  to  protected  the  page  table 
entries  (PTE’s)  of  the  executable  pages,  so  that  the  kernel  can’t  modify  the  page  table 
to  point  at  different  pages.  I.e.,  the  pages  themselves  cannot  be  changed,  and  the  VM 
subsystem  “pointers”  to  the  pages  cannot  be  changed. 

As  soon  as  the  call  gate  returns,  the  user  application  can  place  a  message  in  M2 
whenever  it  desires. 

8.  In  the  meantime  domain  0  maps  Mi_3  into  user  space.  This  allows  user  space  code 
in  domain  0  to  access  the  hash  of  the  domain  U  system  call  table,  IDT,  and  userapp 
executable  pages  in  domain  U,  as  well  as  to  access  the  value  of  the  two  parameters 
from  domain  U. 

9.  The  user  application  copies  its  message  into  M2. 

10.  Domain  U  then  invokes  the  second  call  gate,  which  means  to  send  the  message  in  M2. 
Invoking  this  call  gate  runs  the  code  in  shared  memory,  which  has  two  major  steps: 

(a)  First,  a  single  Xen  hypercall  is  made,  and  then: 

(i)  M2  is  marked  as  not  writable.  This  is  a  second  way  to  ensure  that  domain 
U  cannot  interfere  with  the  message  being  sent;  we  had  already  ensured  that  the 
kernel  could  not  write  it  and  that  it  was  only  accessible  by  the  process.  Note 
this  is  a  second  defense  because  there  is  potential  for  a  race  condition.  The 
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race  condition  occurs  if  the  process  received  a  software  interrupt  that  caused 
some  malicious  code  inside  the  process  to  execute  after  writing  the  message  but 
before  invoking  the  call  gate.  In  this  case  we  would  have  detected  the  corrupted 
executable  when  we  measured  the  executable.  For  extra  protection  the  messaging 
code  could  temporarily  disable  software  interrupts,  (e.g.,  with  sigprocmaskO). 

(ii)  M0  i  is  marked  as  writable  (except  the  exit_page,  the  last  page  in  Mi,  which 
remains  executable  but  not  writable). 

(iii)  The  process  hash,  a,  and  b  are  recorded  in  Mi. (hash  of  the  domain  U  system 
call  table  and  IDT  also  recorded  in  Mi). 

(iv)  Then  in  the  callgate  it  sets  a  shared  variable  f  lagl  to  inform  domain  0  there 
is  a  message  waiting  to  be  processed.3 

(v)  Domain  U  then  polls  waiting  for  another  shared  variable  f  lag2,  signifying  the 
message  has  been  processed  and  a  reply  is  available. 

(b)  Second,  we  execute  exit_page.  This  transitions  us  back  to  user  mode  after  invoking 
a  hypercall  that  makes  M0)i  read  only. 

11.  When  domain  0  observes  that  f  lagl  is  set,  it  knows  there  is  a  message  available,  so  it 
reads  it  from  M2.  When  a  reply  is  ready,  it  places  the  reply  in  M3  and  sets  flag2  to 
let  domain  U  know  the  message  has  been  processed  and  a  reply  is  available. 

12.  When  domain  U  observes  that  f  lag2  is  set,  it  executes  a  hypercall  to  make  M2  writable 
again  in  case  it  wants  to  send  another  message.  It  reads  the  reply  from  M3. 

13.  If  domain  U  wishes  to  send  another  message,  it  returns  to  step  10.  Note  that  domain  U 
needs  to  send  a  termination  message,  because  domain  0  has  no  way  to  know  otherwise 
when  the  connection  should  be  torn  down. 

3  Our  original  system  design  used  an  interrupt  here  rather  than  polling.  However,  occasionally  the 
implementation  with  the  interrupt  will  pause  for  a  few  seconds  before  continuing.  So  although  work  on  the 
interrupt-based  implementation  is  ongoing,  we  chose  to  present  the  polling  implementation  in  this  thesis 
rather  than  rely  on  the  assumption  that  the  pause  issue  will  be  fixed. 
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14.  When  domain  U  has  finished  with  all  messages  it  wants  to  send,  it  invokes  call  gate 
3.  This  unmaps  M2) 3  and  sends  IRQ3  to  domain  0,  to  inform  it  that  it  is  no  longer 
attached  to  the  shared  memory.  Domain  U  then  unmaps  Mo_3  and  closes  the  device 
file  that  connects  it  to  the  kernel. 

15.  When  domain  0  receives  the  IRQ3,  it  unmaps  M2) 3,  closes  the  device  file  that  is  con¬ 
nected  to  the  kernel  module,  and  destroys  the  shared  memory. 

3.5  Implementation 

Our  system  was  implemented  in  the  following  environment.  The  hypervisor  is  Xen  3.3.1. 
Domain  0  runs  Ubuntu  8.04  (Linux  2.6.18.8-xen.hg  kernel)  as  its  guest  OS,  and  domain 
U  runs  Ubuntu  8.04  (Linux  2.6.18.8-xen.hg  kernel)  as  its  guest  OS.  For  the  digital  signing 
library,  we  use  Peter  Gutmann’s  crypt  lib  library,  which  is  available  under  both  open-source 
license  (Sleepycat,  which  is  GPL-compatible)  and  a  commercial  license  for  closed-source 
commercial  use.  The  crypt  lib  also  provides  certificate  management  services,  including  key 
generation  in  response  to  certificate  requests. 

In  order  to  safely  and  efficiently  implement  the  runtime  memory  protection  of  the  pro¬ 
tected  monitor,  we  did  the  following: 

•  First,  we  ensure  that  the  shared  memory  cannot  be  unmapped,  remapped,  or  mapped 
partially  via  hypercalls  from  domain  U.  In  order  to  achieve  this  goal,  after  the  domain 
0  shared  the  4  megabytes,  we  fill  the  shared  protection  table  with  the  references  of 
these  1024  pages.  Then  we  set  the  flag  shared_raeraory  =  1  (domain  0  already  shared 
memory,  so  the  domain  U  cannot  use  the  normal  hypercall  to  map  these  memory 
pages).  When  the  domain  U  want  to  map  the  shared  memory  we  check  in  function 
gnttab_map_grant_ref :  we  will  see  if  shared_meraory  ==  1  or  shared_memory  ==  2 
and  the  reference  that  domain  U  want  to  map  is  in  our  shared  protection  table  or  not. 
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If  is  in  the  table  we  will  prevent  it  to  map  this  page.  So  the  shared  memory  cannot  be 
mapped  partially  via  hypercalls  from  domain  U. 

The  only  way  to  map  these  memory  pages  is  using  our  new  hypercall.  Our  new  hy¬ 
percall  only  accepts  two  inputs:  One  is  the  map  or  unmap  flag.  If  the  input  equals 
GNTTABOP_raap_grant_ref  that  means  map  the  4MB  a  single  time.  The  other  input 
is  the  domain  U  shared_pages_addr;  this  will  be  stored  in  Xen.  Later  we  will  mod¬ 
ify  the  GDT  to  let  the  address  point  to  this  one.  In  this  hypercall  if  the  variable 
shared_memory  ==  1,  then  we  begin  the  mapping  process.  Before  each  page  we  map 
we’ll  temporarily  set  the  shared_memory  to  0  to  let  the  normal  map  progress,  then 
set  the  shared_memory  flag  to  2  to  prevent  domain  U  kernel  using  this  hypercall  to 
map  the  memory  to  another  virtual  address.  To  prevent  domain  U  from  removing 
the  mapping  using  the  normal  unmap  hypercall  we  add  a  check  in  the  Xen  function 
gnttab_unmap_grant_ref .  If  shared_raeraory  ==  2  and  the  reference  domain  U  re¬ 
quests  to  unmap  is  in  our  shared  protection  table,  we  reject  the  unmap  request.  So 
the  shared  memory  cannot  be  unmapped,  nor  unmapped  partially,  via  hypercalls  from 
domain  U. 

There  are  two  more  things  we  need  to  take  care  of.  One  is  the  write  access:  we 
don’t  want  the  domain  U  kernel  to  write  other  things  like  the  attack  code  in  the 
shared  memory,  so  we  need  to  make  sure  the  shared  memory  is  readonly.  The  other 
is  the  NX  bit.  Our  original  implementation  interfered  with  the  use  of  the  NX  bit 
during  system  development  (32-bit  PAE  (Physical  Address  Extension)  kernels  use  the 
NX  bit),  and  were  able  to  work  around  it,  so  that  we  didn’t  remove  this  important 
defense.  In  order  to  achieve  these  two  properties,  we  use  *( (unsigned  long  *)plle) 
&=  Oxfffffffd  to  change  the  page  table  entry  to  make  the  shared  memory  in  domain 
U  readonly,  and  use  ^(((unsigned  long  *)plle)+l)  &=  0x7fffffff  to  change  the 
M0  pages  (protected  monitor)  and  last  page  in  the  Mi  (exit  page)  to  make  these  pages 
executable.  This  will  allow  the  callgate  to  jump  to  the  M0  (protected  monitor). 
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After  the  user  application  finishes  sending  all  of  its  messages,  it  will  ask  the  domain 
U  kernel  module  to  unmap  the  shared  memory  using  our  new  hypercall  with  input 
GNTTABOP_unmap_grant_ref .  The  hypercall  checks  that  shared_meraory  ==  2  to  verify 
that  the  4MB  shared  memory  can  be  unmapped. 

•  Second,  we  check  that  the  parameters  of  the  domain  U  hypercall  request  to  map 
the  shared  memory  are  correct.  The  ordinary  Xen  hyp  ere  all  do_grant_table_op  needs 
domain  U  to  give  parameters  including  the  id  of  the  other  domain,  the  references  of  the 
the  shared  memory  pages,  and  the  virtual  address  of  the  shared  memory.  Our  call  only 
requires  the  virtual  address  of  the  shared  memory  in  domain  U  (later  we  will  modify 
the  GDT  to  point  to  this  address).  The  references  for  the  shared  memory  pages  we 
will  get  from  domain  0,  in  order  to  prevent  domain  U  from  using  our  hypercall  to  map 
other  memory  that  domain  0  shared  into  domain  U.  And  the  other  domain’s  id  is  the 
secure  VM’s  id;  so  that  domain  U  cannot  use  this  hypercall  to  map  the  memory  of 
some  different  domain.  Our  hypercall  also  checks  if  this  domain  U  is  the  one  that  was 
specified  for  the  server  in  the  secure  VM. 

•  Third,  we  modify  the  GDT  to  make  the  callgate  point  to  the  correct  address.  First 
we  need  to  copy  the  protected  monitor  code  into  M0  in  domain  0.  Since  domain  0  is 
the  secure  VM,  the  protected  monitor  is  correct,  and  since  the  access  to  M0  in  domain 
U  is  readonly,  the  protected  monitor  code  cannot  be  changed.  We  also  copy  the  exit 
page  code  into  the  last  page  of  Mi. 

After  this  we  use  a  hypercall  to  modify  the  GDT  because  only  Xen  can  modify  the  GDT. 
During  the  Xen  boot  we  added  GDT  entries  for  our  three  new  callgates.  However,  the 
shared  memory  address  in  domain  U  could  not  be  determined  during  boot.  So  here  we 
fill  the  GDT  entry  with  the  domain  U  shared  memory  address  now  that  it  is  available. 

•  Fourth,  we  ensure  that  the  kernel  cannot  modify  page  table  entries  that  point  to 
the  shared  memory.  This  is  difficult  because  the  domain  U  kernel  has  three  ways  to 
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modify  the  page  table  entry:  hypercall  do_mrau_update,  do_update_va_mapping  and 
ptwr_do_page_f  ault.  And  each  way  could  be  subject  to  two  kinds  of  attacks:  one  is 
change  its  own  PTE  to  map  the  protect  page  (shared  memory),  and  the  other  is  to 
change  the  protected  PTE  to  map  other  memory  or  change  the  write  access  bit.  We 
discuss  how  we  protect  against  these  in  Section  3.6. 

•  Fifth,  at  all  times  access  to  the  shared  memory  from  domain  U  kernel  space  is  protected 
against  writes.  We  write-protect  access  to  M2  for  for  domain  U  userspace  from  the 
beginning  of  call  gate  2  until  domain  U  receives  the  reply. 

•  Sixth,  we  ensure  that  kernel  cannot  modify  M2.  For  example,  the  kernel  cannot  change 
message  after  written  to  M2  but  before  call  gate  2  invoked  to  send  the  message. 

•  Seventh,  while  protecting  the  domain  U  executable  and  the  4M  shared  memory  is  nec¬ 
essary,  searching  the  list  of  1024  pages  each  time  to  see  if  this  is  a  page  we  protect  would 
be  very  slow.  So  we  use  one  bit  in  page->u.  inuse  ,type_info  (in  Xen’s  f  rame_table), 
which  we  named  PGT_entry_protected,  to  mark  whether  this  page  needs  to  be  pro¬ 
tected.  So  every  time  we  merely  need  to  check  this  bit,  and  if  it  is  set  then  we  prevent 
domain  U  from  changing  the  page  table  entry. 


3.6  Security  Analysis 

Here  we  analyze  the  security  of  a  system  designed  as  described  and  carefully  implemented. 

We  consider  hit-and-run  and  hit-and-stick  attacks  that  can  compromise  the  user  VM. 
There  are  two  basic  ways  to  attack  the  system:  (i)  attacking  the  protected  monitor;  (ii) 
attacking  the  crypto  service  via  attacks  against  cryptography,  or  attacks  against  key  secrecy, 
or  attacks  against  applications  that  request  digital  signatures,  or  attacks  that  falsely  request 
digital  signatures.  In  what  follows  we  argue  why  the  attacks  cannot  succeed.  We  organize 
the  analysis  by  attacks  against  components  organized  by  their  physical  location:  domain  U, 
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domain  0,  and  components  that  are  not  contained  within  a  specific  domain,  after  introducing 
our  threat  model. 

3.6.1  Defeating  Attacks  against  Domain  U  Components 

Attack  attempting  to  prevent  installation  of  the  protected  monitor.  We  explain 
why  such  an  attack  cannot  prevent  the  protected  monitor  from  being  mapped  into  dom  U 
memory. 

•  An  attacker  in  the  kernel  cannot  intercept  and  fake  the  hypercall  that  maps  the  4MB 
memory  into  the  kernel  address  space  for  dom  U  without  being  detected  (then  the 
system  can  be  cleaned  up  before  installing  the  protected  monitor).  Here  the  attacker 
deliberately  does  not  actually  make  the  real  call  to  map  the  memory.  However,  this 
will  be  detected  because  call  gate  1  will  report  failure  because  it  identifies  that  the 
special  4M  was  never  mapped. 

•  An  attacker  cannot  interfere  with  the  4M  mapping  by  calling  Xen  hypercalls  themselves 
because  of  the  following,  (i)  Our  modifications  to  Xen  ensure  that  attacker  cannot 
map  it  before  we  do  and  cannot  map  only  part  of  that  memory.  The  latter  is  achieved 
because  we  store  the  shared  memory’s  MFN  in  a  page  (step  4  of  Figure  3.5),  and  after 
domain  0  grants  the  4MB  memory  Xen  will  prevent  domain  U  from  using  the  hypercall 
do_grant_table_op  to  map  the  shared  memory  pages.  Although  the  attacker  could 
map  4MB  using  our  call  before  we  do,  this  just  makes  our  mapping  request  redundant 
and  does  not  cause  any  security  problem  because  it’s  idempotent.  (ii)  Our  modifications 
to  Xen  ensure  that  the  attacker  can  neither  unmap  nor  remap  (any  portion  of)  the 
4MB  after  we  map  it.  This  is  because  after  using  our  hypercall  the  shared_memory 
flag  is  changed  to  mapped,  which  prevents  domain  U  remapping  the  shared  memory,  so 
that  domain  U  cannot  use  our  hypercall  again  to  remap  the  shared  memory  to  another 
virtual  address.  Moreover,  using  the  hypercall  do_grant_table_op  cannot  map  or 


57 


unmap  part  of  that  memory  somewhere  else. 


•  An  attacker  cannot  fake  the  malloc  ()  result,  which  is  used  in  ensure_shared_memory  () . 
Either  malloc  returns  2MB  of  allocated  memory  in  the  process  address  space  or  it 
doesn’t.  After  the  1st  call  gate  the  hypercall  we  write  will  map  it  to  M2  (readonly  after 
the  2nd  callgate)  and  M3 (readonly),  then  will  protect  the  page  table  entries  of  these 
2MB  memory.  So  that  attacker  cannot  map  it  to  his  own  memory  or  write  the  M3  to 
modify  the  signatures. 

•  An  attacker  that  has  compromised  the  kernel  cannot  modify  the  kernel’s  own  page 
table  in  order  to  access  the  shared  memory  directly.  Since  the  kernel  can  only  modify 
page  tables  through  Xen,  even  for  the  kernel’s  own  page  table,  we  can  use  Xen  to 
prevent  the  kernel  from  modifying  its  own  page  table  to  access  the  shared  memory. 
We  use  one  bit  in  page->u.  inuse .  type_info  (in  Xen’s  frame_table)  that  we  named 
PGT_entry_protected  flag  to  mark  the  pages  that  need  to  protect.  More  specifically, 
there  are  three  kinds  of  page  table  entries  (PTE’s)  that  need  protection  from  modifi¬ 
cation: 

1.  Domain  U  kernel  space  mapping  of  MO-MI. 

2.  The  domain  U  user  application  mapping  of  M2-M3  (M3  always  readonly  and  M2 
is  readonly  from  the  beginning  of  the  2nd  callgate  until  the  return  from  the  call 
gate. 

3.  The  domain  U  user  application  has  its  own  executable  pages  (these  need  to  be 
protected  to  prevent  TOCTOU  attacks  that  change  the  executable  after  we  first 
measure  it,  so  that  we  just  need  to  measure  it  during  the  1st  callgate). 

The  attacker  (i.e. ,  compromised  kernel)  has  two  ways  to  attack  these  three  kinds  of 
PTE’s:  First,  the  attacker  could  try  to  use  his  own  page  table  entries  to  map  to  the 
M0-M3  or  userapp  executable  pages,  which  seems  possible  because  the  kernel  can  set 
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some  page  table  entry  with  write  access  to  it.  However,  the  attacker  cannot  map  his 
virtual  address  to  M0-M3  in  domain  U,  because  these  pages  are  owned  by  dornO,  so 
that  M0-M3’s  page  table  entries  cannot  be  attacked  by  this  way.  And  we  marked  the 
domain  U  user  application  executable  pages  as  protected,  so  that  the  attacker  who 
wants  to  map  his  own  page  table  entries  to  user  application  will  be  detected  by  Xen 
which  will  then  prevent  this  attack  from  succeeding.  Second,  the  attacker  can  try  to 
modify  the  PTE’s  of  domain  U’s  M0-M3  or  the  application’s  executable  pages  so  as 
to  let  the  PTE’s  map  to  the  attacker’s  own  pages.  If  the  attack  succeeds,  the  domain 
0  may  help  the  attacker  to  sign  a  wrong  message.  But  we  also  prevent  this  kind  of 
attack.  We  have  mark  these  page  table  entries  when  the  domain  U  map  the  shared 
memory,  so  that  these  page  table  entries  cannot  be  modify  by  attacker. 

Attack  attempting  to  tamper  with  the  protected  monitor  memory  content.  This 

is  defeated  because  all  reads,  writes,  and  executes  of  bytes  within  the  protected  monitor’s 
memory  region  are  blocked  by  the  hypervisor  via  the  MMU.  This  means  no  software  running 
within  the  VM  can  read,  write,  modify,  or  arbitrarily  execute  protected  monitor  code,  irre¬ 
spective  of  the  CPU  privilege  level.  Recall  there  is  a  special  entry  page  (“jump  page”)  that 
when  executed  deprotects  the  protected  pages  so  that  the  PM  can  be  invoked  from  outside 
the  PM.  The  jump  page  contains  only  vectors  (jumps)  to  specific  known  entry  points,  and 
cannot  be  read  or  written  until  execution  in  it  is  begun.  As  a  result,  the  PM  code  and  data 
cannot  be  tampered  with  in  any  way. 

Attack  attempting  to  starve  the  protected  monitor  of  the  CPU.  For  this,  the 
attacker  would  somehow  prevent  any  user  application  from  calling  in  to  the  PM.  Because 
we  are  not  attempting  to  entirely  control  the  user  VM,  this  attack  must  succeed  against  the 
prototype.  Note  this  does  not  subvert  the  PM,  nor  guarantee  access  to  resources  controlled 
by  the  PM.  It  merely  means  the  PM  will  not  execute. 

Attack  attempting  to  regain  control  of  CPU  when  it  is  executing  inside  the 
protected  monitor.  A  major  attack  vector  is  to  regain  control  of  the  CPU  somehow 
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while  it  is  executing  inside  the  protected  monitor.  The  most  obvious  mechanism  for  this  is 
scheduling  a  timer  interrupt.  We  can  take  care  of  this  by  masking  interrupts  while  inside  the 
protected  monitor.  However,  some  system  management  interrupts  (e.g.,  power  events)  are 
non-maskable  and  hence  cannot  be  disabled  by  disabling  interrupts.  Thus  there  are  a  few 
intricate  low-level  attacks  that  the  scheme  is  susceptible  to.  In  particular,  modifying  BIOS 
(Basic  Input/Output  System,  basic  PC  firmware)  or  SMI  (System  Management  Interrupt) 
code  could  be  used  to  stage  an  attack.  Such  attacks  require  considerable  knowledge  and 
skill  and  are  frequently  hardware-specific.  Note  that  the  attacker  cannot  regain  control  by 
causing  VM  faults,  because  Xen  mediates  all  page  faults.  Moreover,  all  pages  we  will  access 
(the  4  megabytes  of  shared  memory  and  the  executable  we  hash)  were  paged  in  and  then 
locked,  so  there  won’t  be  any  faults  on  them  while  we  run.  If  there  were  some  way  to  regain 
control  of  the  CPU  while  it  was  operating  inside  the  monitor,  there  might  be  some  way  to 
use  this  to  impersonate  a  user  process  and  retrieve  the  key  belonging  to  that  process. 
Attack  attempting  to  impersonate  the  service  caller.  This  is  difficult  to  do  because 
the  backend  monitor  inspects  the  binary  making  the  invocation.  So  in  order  to  impersonate 
the  caller,  the  attacker  must  somehow  either  use  the  same  binary  or  subvert  the  hashing 
process.  In  the  first  case,  where  the  attacker  somehow  convinces  the  correct  binary  to 
disclose  a  secret,  this  is  an  attack  against  the  application  itself  and  is  outside  our  security 
claim.  We  expect  the  second  case,  where  the  attacker  subverts  the  hashing  process  to  yield 
an  incorrect  result,  to  be  quite  difficult  for  the  attacker.  Since  we  perform  hashing  from 
outside  domain  U  with  the  pages  having  already  been  forced  into  memory  (preventing  page 
fault  handler  attacks),  the  only  way  we  can  see  to  do  this  is  to  misrepresent  which  pages 
constitute  the  application  in  question,  which  is  very  difficult  since  we  use  the  same  data 
structures  to  determine  the  application  pages  as  the  CPU  does  when  it  executes  them. 

It’s  important  to  note  that  we  hash  the  executable  at  the  first  call  gate,  and  then  lock 
the  executable  pages  so  they  can’t  be  modified.  This  is  for  two  reasons:  (i)  The  performance 
impact  is  lower,  since  there  is  a  hash  at  the  beginning  instead  of  every  time  a  message  is  sent. 
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(ii)  This  prevents  subtle  TOCTOU  attacks  which  would  otherwise  be  possible  (e.g.,  changing 
the  binary  just  before  sending  the  message,  then  somehow  changing  it  back  afterwards,  even 
by  re-infecting  the  machine). 

3.6.2  Defeating  Attacks  against  Domain  0  Components 

Intuitively,  the  components  in  domain  0  cannot  be  attacked  from  domain  U  because  domain 
0  is  inaccessible  except  via  our  communication  mechanism.  Nonetheless,  we  analyze  possible 
attacks  in  more  detail  to  ensure  a  correct  analysis: 

Attacks  attempting  to  penetrate  into  domain  0.  There  are  basically  two  ways  an 
attacker  could  do  this: 

•  Subvert  the  Xen  hypervisor.  Successfully  subverting  the  hypervisor  from  an  untrusted 
domain  is  precluded  by  our  assumptions  (see  Section  3.2.1). 

•  Exploit  some  software  the  user  has  installed  in  domain  0  in  order  to  control  domain  0 
from  domain  U.  Here  we  must  assume  the  user  does  not  install  some  software  in  domain 
0  that  permits  domain  U  to  arbitrarily  control,  access,  or  modify  domain  0.  One  way 
to  achieve  this  for  most  users  is  to  simply  make  domain  0  not  be  easily  accessible. 

Attack  against  the  domain  0  disk.  The  disk  resides  within  the  accessible  space  of 
domain  0.  Domain  0  may  choose  to  give  domain  U  access  to  the  disk,  but  without  such 
explicit  provision,  domain  U  can  see  only  the  part  of  the  disk  that  is  designed  for  the  use  of 
domain  U,  if  any.  Generally  hypervisors  do  not  provide  any  access  to  the  domain  0  disk  by 
default,  so  our  security  here  depends  on  the  assumption  that  the  hypervisor  has  not  been 
configured  to  a  configuration  that  allows  domain  U  direct  access  to  the  domain  0  disk,  and 
that  no  services  have  been  installed  in  domain  0  that  give  domain  U  general  access  to  the 
domain  0  disk.  (Indeed,  not  allowing  such  access  is  a  default  and  typical  configuration  for 
Xen,  the  hypervisor  we  chose). 
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Attacks  attempting  to  falsely  request  digital  signatures.  There’s  no  way  for  an 

attacker  in  domain  U  to  falsely  request  a  digital  signature  from  domain  0.  This  is  because 

the  domain  U  falsely  requesting  digital  signatures  means  either: 

•  pretending  to  be  a  different  application  or  an  uncompromised  one  (see  Section  3.1). 

•  attacking  the  communication  mechanism  (see  Section  3.3). 

3.6.3  Defeating  Non-Domain-Specific  Attacks 

Here  we  analyze  attacks  against  components  that  are  not  contained  within  a  specific  domain. 

Attacks  against  the  inter-domain  communication.  The  attacker  has  the  following 

options  to  attack  inter-domain  communication: 

•  Attacks  using  the  kernel  to  block  communication 

Our  system  was  very  carefully  designed  to  make  the  communication  process  function 
without  any  reliance  on  the  kernel  so  that  we  are  not  subject  this  attack.  The  kernel 
can  deny  the  CPU  to  an  application,  but  this  results  only  in  a  denial-of-service  attack. 

•  Attacks  against  the  application  to  block  communication 

—  The  attacker  can’t  attack  the  application  binary  because  it’s  protected  by  memory 
protection  once  communication  is  set  up. 

—  The  attack  can’t  attack  memory  pages  with  the  communication  data  in  them 
because  they  are  protected  from  domain  U  access  by  anyone  but  the  application 
(and  the  application  can’t  be  modified). 

—  This  leaves  the  possibility  that  the  attacker  can  somehow  disrupt  communication 
by  attacking  internal  data  structures  of  the  application  in  such  a  way  as  to  disrupt 
communication.  This  depends  on  the  quality  of  the  implementation  itself  and  is 
outside  our  scope  -  we  do  not  and  cannot  attempt  to  protect  the  application  itself 
from  its  own  design  and  implementation. 
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•  Attacks  against  the  communication  mechanism  itself 


—  Attacker  can’t  modify  call  gate.  The  call  gate  is  set  up  in  the  Global  De¬ 
scriptor  Table  (GDT),  which  by  design  in  Xen  can’t  be  modified  by  domain  U. 

—  Attacker  can’t  attack  communication  in  application.  The  cases  and  analy¬ 
sis  from  “attack  application  to  block  communication”  above  apply  here,  with  the 
same  result. 

—  Attacker  can’t  attack  communication  in  kernel.  As  noted  above,  our  design 
excludes  the  kernel  from  the  communication  process,  so  there  is  nothing  here  to 
attack. 

—  Attacker  can’t  attack  communication  in  hypervisor.  Attacking  the  Xen 
hypervisor  from  within  a  guest  domain  is  precluded  by  our  standard  assumptions. 

Attacks  committed  using  virtualization.  Some  attacks  use  virtualization  in  some 
way  to  escalate  an  attacker  to  hypervisor  privilege  and  hide  a  malware  hypervisor  from  the 
operating  system.  Hardware  virtualization  technology  attacks  like  Blue  Pill  [59]  are  not 
possible  because  they  require  executing  virtualization  instructions  at  ring  0  privilege,  but 
Xen  only  allows  domain  U  to  run  at  ring  1  and  higher.  Similarly,  SubVirt  [42],  which  relies  on 
adding  a  hypervisor  early  in  the  machine  boot  sequence,  is  not  possible  because  the  attacker 
is  contained  within  domain  U,  and  it  can’t  support  nested  hypervisors  anyway. 

3.7  Experimental  Evaluation  of  Performance 

In  order  to  measure  the  performance  of  our  system,  we  consider  two  aspects.  First,  we 
examine  the  time  required  to  send  a  message  using  our  protected  monitor  mechanism.  This 
message-sending  mechanism  is  the  substrate  on  which  the  signing  system  is  built,  and  on 
which  any  other  application  of  the  protected  monitor  would  be  built.  Second,  we  examine  the 
total  time  required  to  actually  make  a  signature  using  our  full  system.  We  use  performance 
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of  an  ordinary  signing  application  linked  directly  against  the  ordinary  cryptography  library 
as  a  baseline  for  comparison. 

Our  experimental  setup  is  as  follows.  All  experiments  were  performed  on  an  HP  xw4550 
workstation,  with  a  quad-core  2.3  GHz  AMD  Opteron  processor  and  4  gigabytes  of  RAM. 
The  machine  has  a  vl.2  Broadcom  TPM,  revision  level  A2.  The  software  environment  for 
all  experiments  was  paravirtualized  Xen  3.3.1  installed  on  Ubuntu  8.04  LTS  with  a  2.6.24 
Linux  kernel.  The  guest  VM  used  the  2.6.18  paravirtualized  kernel  that  is  provided  with 
Xen  3.3. 

3.7.1  Microbenchmark  Performance  of  Inter- VM  Communication 


Total  Round-Trip  Time  for  Varying  Size  Messages 


Message  size 

Figure  3.6:  Total  time  required  for  message  creation  and  processing  for  large  round-trip 
messages.  Merely  sending  the  message  both  ways  takes  only  13.3  microseconds  independent 
of  message  size. 

The  time  required  to  actually  send  and  receive  any  message,  including  the  two  domain 
transitions  that  entails,  is  a  mere  13.3  microseconds  when  we  performed  a  simple  performance 
experiment  that  sent  1  million  messages.  This  included  the  time  to  hash  the  executable  in 
domain  U  once.  Note  that  strictly  speaking  message  send  time  is  independent  of  message 
size,  because  the  memory  pages  containing  the  message  are  shared  by  both  parties.  In 
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practice,  however,  this  constant  send  time  experiment  assumes  that  the  client  and  server 
each  want  to  send  the  same  message  to  each  other  over  and  over  (i.e,  they  only  write  the 
message  to  RAM  once),  and  don’t  bother  to  read  it.  Thus  we  decided  we  should  also  create 
a  microbenchmark  where  each  side  reads  and  writes  the  message  it  sends  each  time,  because 
that  time  is  more  significant  than  the  message  send  time. 

Figure  3.6  shows  total  processing  time  required  for  simple  large  messages  (i.e.,  the  client 
and  server  read  and  write  each  message  each  time  but  do  not  perform  any  significant  compu¬ 
tation  between  reading  and  writing  the  messages).  This  includes  the  time  to  create  a  message 
in  domU,  send  it  to  domO,  read  it  and  write  a  reply  message  in  domU,  send  it  back  to  dornO, 
and  read  it  in  domO.  Time  is  measured  from  when  the  executable  is  invoked  through  when 
it  sends  100,000  messages  to  when  execution  returns  to  the  calling  script.  Each  data  point 
is  averaged  over  10  runs.  Recall  that  merely  sending  the  message  from  domO  to  domU  and 
back  again  requires  only  13.3  microseconds;  thus  the  bulk  of  the  time  is  spent  reading  and 
writing  the  message  in  the  buffer.  The  figure  also  includes  the  cost  for  hashing  the  domain 
U  application  once  per  invocation. 

Note  that  for  sizes  up  to  500  kilobytes,  the  message  send  time  is  directly  proportional 
to  the  size.  After  message  size  exceeds  500  kilobytes,  larger  messages  take  less  additional 
time  to  send,  although  the  send  time  is  linearly  related  to  the  message  size  as  size  increases 
further.  I.e.,  the  constant  factor  is  smaller  after  500  kilobytes.  The  reason  for  this  effect  is 
unknown;  it  may  be  due  to  some  efficiency  in  the  memory  system  which  kicks  in  when  very 
long  sequential  memory  accesses  occur  (e.g.,  some  sort  of  optimization  for  large  memory 
transfers  to  the  cache).  Note  that  even  a  1  megabyte  message  takes  only  3.2  milliseconds  to 
both  send  once  and  then  send  back  in  reply,  meaning  we  achieve  a  little  over  312  megabytes 
per  second  round-trip.  Assuming  each  direction  has  equal  cost,  we  can  roughly  estimate  that 
our  communication  channel  can  send  over  600  megabytes  per  second  either  direction. 

The  smallest  messages  are  100  bytes,  1  kilobyte,  and  10  kilobytes,  but  these  are  hardly 
visible  in  the  normal  graph.  We  examine  a  log-log  graph  of  the  same  data  (Figure  3.7)  in 
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order  to  see  them  better.  Here  we  see  that  as  message  sizes  decrease  below  100  kilobytes 
the  curve  seems  to  flatten  out.  Considering  this,  the  fact  that  100  byte  messages  take 
14.7  microseconds,  and  the  fact  that  sending  an  empty  message  takes  13.3  microseconds, 
we  believe  that  for  such  small  messages  the  fixed  send  time  per  message  is  dominating  the 
memory  access  time.  Only  as  messages  increase  from  10  kilobytes  to  100  kilobytes  does  the 
time  to  actually  access  the  memory  begin  to  dominate  the  fixed  per-message  overhead.  We 
suspect  this  means  that  the  effective  speed  of  large  messages  is  limited  primarily  by  the 
memory  bandwidth  of  the  CPU  and  memory  subsystem. 


Total  Round-Trip  Time  for  Varying  Size  Messages 


Figure  3.7:  LOG-LOG  graph  of  total  time  required  for  message  creation  and  processing  for 
large  round-trip  messages.  Note  how  the  curve  flattens  out  towards  the  left,  showing  that 
the  fixed  time  per  message  is  dominating  the  variable  time  for  larger  messages. 


3.7.2  Assured  Signing  Performance 

Figure  3.8  shows  the  performance  of  our  system  when  creating  signatures  of  varying  sizes. 
We  run  the  system  to  create  a  single  signature  at  a  time,  including  hashing  the  domain  U 
executable  each  time.  Each  point  in  the  graph  is  averaged  over  100  runs. 

Figure  3.8(a)  shows  three  curves.  The  curve  labeled  “Plain  crypto  library”  is  execution 
time  when  creating  a  signature  directly  in  domain  0  by  directly  invoking  the  cryptography 
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Total  Time  for  Varying  Data  Sizes 
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(a)  Execution  time  for  system  with  TPM, 
without  using  TPM,  and  a  baseline 


(b)  Execution  time  for  system  without  using 
TPM,  and  a  baseline 


Figure  3.8:  Time  required  to  produce  and  verify  signatures  of  varying  sizes  for  different 
variations  of  the  system  and  a  baseline  of  the  ordinary  cryptography  library  running  inside 
the  same  domain.  The  second  graph  removes  the  top  curve  (full  system  with  TPM)  to  make 
the  two  lower  curves  more  visible. 


library.  This  is  shown  only  for  comparison  purposes.  The  curve  labeled  “System  without 
TPM”  is  execution  time  when  creating  a  signature  in  domain  0  from  a  domain  U  request  using 
our  system  without  using  the  TPM,  which  means  we  have  the  security  guarantees  except 
remote  verifiability.  The  curve  labeled  “Full  system”  is  execution  time  of  our  full  secure 
system  creating  a  signature  in  domain  0  from  a  domain  U  request,  including  generating 
values  for  TPM  remote  verification.  Figure  3.8(b)  focuses  on  the  two  lower  curves  so  they 
can  be  seen  clearly:  our  system  without  TPM,  and  directly  invoking  the  normal  cryptography 
library. 

From  Figure  3.8  we  make  three  observations:  First,  using  the  TPM  slows  down  the  system 
substantially.  TPM  operations  are  often  slow  and  even  individual  operations  take  up  to  1 
second,  as  described  in  [49].  Second,  run  time  does  increase  as  the  size  of  the  message  being 
signed  increases,  but  not  by  much.  Third,  without  the  TPM  our  system  is  around  thirty 
milliseconds  slower  than  the  base  line.  We  believe  this  performance  impact  is  acceptable  for 
systems  that  are  not  performing  signatures  continuously,  especially  for  interactive  systems 
that  only  perform  signatures  on  user  demand. 

Notes  on  execution  time.  Our  prototype  was  designed  primarily  for  simplicity  of  imple¬ 
mentation,  since  it  directly  maps  each  call  on  the  cryptlib  API  to  a  call  to  the  secure  domain. 
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Figure  3.9:  LOG-LOG  graph  of  time  required  to  produce  and  verify  signatures  of  varying 
sizes  for  system  with  TPM,  without  using  TPM,  and  a  baseline.  Although  the  logarithmic 
scale  reduces  the  differences  between  the  curve  magnitudes,  it  allows  us  to  compare  the  shape 
of  the  curves. 

Coalescing  these  calls  together  using  an  intelligent  communication  layer  would  allow  a  signif¬ 
icant  reduction  in  the  number  of  domain  transitions  Of  course,  simply  redesigning  the  API 
could  easily  give  an  API  that  sends  as  little  as  one  message.  However,  redesigning  the  API 
would  break  transparency  with  existing  clients.  Secondly,  note  that  in  these  tests  the  data 
to  be  signed  does  get  copied  once  during  the  signing  process,  but  this  copy  is  required  by 
the  design  of  cryptlib.  cryptlib  takes  a  pointer  to  the  data;  its  API  provides  no  way  to  say 
where  the  data  could  be  put  to  do  zero  copy  processing  -  if  it  did  then  we  could  simply  have 
the  client  place  the  data  directly  in  the  shared  buffer. 


3.8  Discussion:  Protected  Monitor  Generality  and  Con¬ 
tribution 

Section  3.3  introduced  the  protected  monitor  as  part  of  a  system  that  provided  higher- 
security  digital  signatures.  While  we  introduced  the  mechanism  along  with  an  example 


application  to  facilitate  explanation,  the  protected  monitor  is  far  more  general  than  that 
particular  application. 

Traditional  OS  security  relies  on  the  kernel  as  the  root  of  trust  for  securing  all  applica¬ 
tions.  Since  all  security  is  derived  transitively  from  the  security  of  the  kernel,  the  security 
of  applications  (including  antivirus  programs,  Internet  security  suites,  and  cryptographic 
software)  is  dependent  on  the  security  of  the  kernel.  However,  attackers  are  frequently  able 
to  subvert  the  kernel  protection,  such  as  with  rootkits.  Additionally,  the  kernel  implements 
almost  all  access  control  on  a  per- user  level.  While  this  is  useful  for  separating  the  resources 
of  one  user  from  another,  the  user-based  security  model  provides  little  protection  in  the  case 
of  malware  that  runs  as  the  user  who  is  being  attacked. 

Thus  we  created  the  “protected  monitor”,  a  powerful  software-based  root  of  trust  that 
cannot  be  easily  subverted  by  malware.  This  gives  a  platform  for  providing  a  variety  of 
secure  services  that  is  less  privileged  than  the  hypervisor  but  is  more  privileged  than  the 
guest  OS  kernel  while  still  allowing  interaction  with  the  guest  OS  and  applications. 

We  believe  there  are  a  variety  of  suitable  applications  beyond  the  assured  digital  signature 
service  provider  developed  earlier  in  this  chapter: 

•  a  mechanism  for  providing  transparent  protection  for  critical  user  secrets,  as  described 
further  in  Future  Work  (Section  5.2).  The  protected  monitor  allows  us  to  implement 
this  without  requiring  any  changes  to  the  OS  or  application  or  libraries  in  the  untrusted 
VM.  Note  loading  the  kernel  module  does  remain  necessary. 

•  a  protected  cryptographic  service  library,  also  described  further  in  Future  Work  (Sec¬ 
tion  5.2). 

•  a  host-based  intrusion  protection  and  containment  system  with  intelligent  monitoring 
for  malware  behaviors,  such  as  process  self-hiding  and  executable  packing  (e.g.,  de¬ 
crypting  or  decompressing  executables  on  the  fly)  and  unusual  network  activity.  We 
anticipate  the  ability  to  use  provenance  of  the  executable  and  secure  user  interaction 
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to  determine  how  to  react  to  processes  exhibiting  suspicious  behavior.  By  inserting 
kernel  hooks  we  can  easily  control  whether  the  kernel  allows  suspicious  processes  to 
access  various  resources  without  having  to  actually  modify  the  kernel. 

•  securing  traditional  anti-virus  and  anti-malware  software  by  building  its  implementa¬ 
tion  on  the  protected  monitor,  allowing  it  to  be  protected  from  attacks  by  the  memory 
protection  and  secured  storage  available  by  interacting  with  the  trusted  VM  while  still 
interacting  with  the  operating  system  in  its  traditional  way.  Additionally,  direct  access 
to  some  resources  of  the  untrusted  VM,  such  as  the  disk  and  network,  can  be  provided 
via  the  trusted  VM,  so  that  malicious  software  cannot  interfere  with  the  access  in  any 
way. 

Comparison  to  related  work.  VM  introspection  has  become  an  important  security  mech¬ 
anism.  The  initial  idea  [18,  27]  was  to  exploit  hypervisor  for  isolating  intrusion  detection 
systems  (IDS)  from  the  systems  they  monitor,  but  was  later  extended  by  numerous  studies. 
For  example,  one  can  insert  traps  into  the  monitored  VM  so  as  to  capture  certain  events  [4], 
where  the  monitor  code  executes  either  in  the  hypervisor  or  in  a  trusted  VM.  This  is  different 
from  our  solution  because  the  protected  monitor  resides  in  the  user  VM,  which  allows  it  to 
avoid  much  of  the  semantic  gap  in  VM  introspection.  In  Section  3.2.3  we  discussed  why  our 
in-VM  protected  monitor  is  much  more  powerful  than  VM  introspection. 

In  many  ways  the  work  that  is  most  related  to  our  protected  monitor  is  Sharif  et  al.’s 
secure  in-VM  monitoring  [62],  which  takes  advantage  of  hardware-supported  virtualization 
to  achieve  better  introspection. 

That  work  only  does  virtual  machine  introspection  and  monitoring  of  the  untrusted  VM; 
no  provision  is  made  for  secure  communication  between  user  applications  and  the  secure  VM. 
This  could  be  emulated  to  a  limited  extent  by  having  the  secure  VM  examine  the  untrusted 
VM  and  try  to  read  application  data,  but  there  is  no  mechanism  for  it  to  communicate  data 
back  to  applications  in  the  untrusted  VM,  and  it  also  does  not  allow  for  synchronous  func¬ 
tion  invocation  (applications  would  need  to  use  something  like  a  shared-memory  busywait 
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model).  There  is  no  memory  protection  of  the  application  data  and  no  protection  of  the 
application  or  the  communication  process  from  the  kernel  or  other  applications.  Moreover, 
their  work  requires  Intel’s  hardware  support  for  virtualization  (Virtualization  Technology,  or 
VT),  limiting  them  to  recent  Intel  CPU’s  (presumably  their  work  could  be  ported  to  AMD’s 
similar  mechanism),  whereas  Xen  can  run  on  essentially  any  Intel-compatible  CPU  (we  need 
only  386  and  higher  with  PAE  support,  which  was  introduced  in  the  mid  1990’s). 

3.9  Summary  and  Limitation 

We  present  an  effective  solution  to  malware  attempts  to  compromise  private  signing  keys 
or  to  falsely  request  digital  signatures.  Our  solution  not  only  completely  secures  the  keys 
from  the  malware,  but  also  can  be  used  by  existing  applications  without  any  modification  to 
their  source  code.  We  also  introduce  a  powerful  mechanism  for  securely  providing  services  to 
applications  in  a  VM,  which  we  believe  will  be  of  independent  value.  Finally,  we  demonstrate 
that  our  mechanisms  have  reasonable  performance. 

A  limitation  of  this  work  and  opportunity  for  future  work  is  determining  how  to  mea¬ 
sure  the  domain  U  kernel  code,  without  interference  from  the  data  structures  and  runtime 
patching  that  cause  variation  in  the  contents  of  the  Linux  2.6  kernel  code  space.  This  would 
allow  us  to  describe  the  state  of  the  domain  U  kernel  as  part  of  our  attested  signatures,  so 
that  a  verifier  could  attest  that  the  kernel  binary  was  not  compromised.  One  way  to  do  this 
would  be  to  develop  a  comprehensive  list  of  parts  of  the  kernel  that  can  change,  and  simply 
omit  all  of  those  when  measuring.  The  challenge  would  be  identifying  these  bytes  in  a  way 
that  is  robust  to  changes  in  the  kernel  caused  by  continuing  kernel  development. 
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Chapter  4 

RELATED  WORK 


4.1  Overview 

Here  we  examine  the  existing  work  that  is  most  closely  related  to  protecting  cryptographic 
keys  and  cryptographic  functions.  A  relevant  recent  survey  can  be  found  in  [55].  Because 
keys  are  a  special  type  of  secret  data,  we  also  consider  certain  work  that  deals  with  securing 
secrets.  We  organize  the  related  work  into  three  categories: 

•  Protecting  cryptographic  keys  (Section  4.2) 

•  Protecting  cryptographic  functions  (Section  4.3) 

•  Protecting  cryptographic  keys  and  functions  (Section  4.4) 


4.2  Protecting  Cryptographic  Keys 

4.2.1  Protecting  Keys  with  Special  Hardware 

The  most  straightforward  method  to  protect  cryptographic  keys  and  other  secrets  is  to  utilize 
some  special  hardware  devices,  such  as  cryptographic  co-processors  [78]  or  Trusted  Platform 
Modules  [32],  Still,  such  devices  may  be  no  panacea  because  they  introduce  hardware-related 
risks  such  as  side-channel  attacks  [43].  Moreover,  many  systems  do  not  have  or  support  such 
devices. 
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4. 2. 1.1  Trusted  Platform  Module 


The  vast  majority  of  hardware  solutions  proposed  for  securing  critical  secrets  rely  on  the 
Trusted  Platform  Module,  or  TPM,  proposed  by  the  Trusted  Computing  Group  [32],  We  note 
that  our  work  has  several  points  of  superiority  compared  to  a  typical  TPM-based  system: 

•  Our  system  does  not  require  special  hardware,  unlike  TPM,  although  we  can  leverage 
a  TPM  to  provide  additional  assurance  to  a  remote  verifier  of  signatures. 

•  We  provide  better  performance,  partly  because  the  TPM  is  frequently  handicapped  by 
its  LPC  bus,  which  was  required  to  avoid  too  much  cost. 

•  Our  system  can  also  be  upgraded,  whereas  the  TPM  design  deliberately  precludes 
upgrades. 

•  Our  protected  monitor  platform  (Chapter  3)  does  not  fundamentally  depend  on  the 
integrity  of  the  kernel,  whereas  functionality  of  the  TPM  software  stack  does  depend 
on  kernel  integrity  (for  example,  it  depends  on  the  TPM  device  driver) 

•  Our  platform’s  capabilities  are  much  more  general  than  what  the  TPM  directly  sup¬ 
ports.  For  example,  our  protected  monitor  can  execute  arbitrary  code,  including  calling 
into  the  operating  system  kernel. 

•  Significantly,  we  believe  we  are  not  subject  to  various  kinds  of  binary-replacement 
attacks  that  apply  to  typical  software  checksumming  (note  the  TPM  has  to  rely  on 
software  to  compute  the  hash  of  a  binary-due  to  its  low-bandwidth  bus  it  could  not 
perform  hardware-checksumming  even  if  it  were  part  of  the  design).  Oorschot  capably 
lays  out  several  such  attacks  in  [72]  and  [68].  Essentially,  these  attacks  defeat  “self¬ 
hashing”  code  by  utilizing  “operating  system  level  manipulation  of  processor  memory 
management  hardware”  on  compromised  kernels.  Since  the  hypervisor  is  at  a  higher 
level  of  abstraction  (and  in  fact  is  often  responsible  for  managing  the  illusion  of  direct 
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access  to  that  memory  management  hardware),  it  is  not  subject  to  such  attacks.  In 
fact,  our  external  verifier  is  essentially  completely  isolated  by  VM  isolation. 

Caveat:  This  imperviousness  to  some  checksumming  attacks  does  not  come  entirely  for 
free;  virtual-machine  introspection  has  to  rely  on  kernel-level  data  structures  in  the  VM  in 
order  to  establish  the  pages  that  constitute  the  code  for  a  given  process,  for  example.  The 
technical  report  [65]  studies  implications  of  the  reliance  of  virtual-machine  introspection  tools 
on  the  integrity  of  kernel  data  structures,  concluding  that  efficacy  of  VM-based  introspection 
typically  still  relies  on  data  structures  the  VM  can  manipulate,  and  gives  examples  of  attacks. 
According  to  their  report,  they  demonstrate  their  attacks  can  still  “undetectably  hide  a  kernel 
module,  hide  a  running  process,  and  add  Trojan  versions  of  critical  software.”1  However, 
they  also  develop  a  tool  that  can  still  perform  some  monitoring  without  being  subject  to 
such  attacks. 

We  also  can  provide  secure  auditing  for  cryptographic  operations,  storing  logs  in  an 
inaccessible  protection  domain,  where  they  cannot  be  tampered  with  or  destroyed  from  the 
insecure  domain.  This  greatly  reduces  vulnerability  to  denial-of-service  attacks  on  logging. 

A  Time-Of-Check  Time-of-Use  attack  on  a  TPM  system  is  demonstrated  in  [11],  The 
application  binary  is  modified  after  the  TPM  computes  the  hash  but  before  the  binary  is 
executed. 

4.2. 1.2  Other  Hardware  Solutions 

“Architecture  for  Protecting  Critical  Secrets  in  Microprocessors”  [45]  proposes  an  elaborate 
and  thorough  “secret-protected”  hardware  architecture  to  protect  against  software  and  DMA 
attacks.  The  work  is  impressive  and  complete,  with  features  such  as  cryptographic  keys  that 
follow  their  users  between  devices,  rather  than  being  tied  to  particular  devices.  However,  it 
is  highly-complex  in  addition  to  requiring  changes  to  the  CPU  and  operating  system,  and 

1Although  we  make  no  attempt  to  demonstrate  it,  we  believe  such  attacks  can  be  defeated,  at  least  in¬ 
principle.  A  powerful  way  to  defeat  these  attacks  would  be  by  actually  running  the  code  in  question  from 
the  hypervisor,  since  when  the  code  is  running  it  must  provide  the  actual  version  of  itself  to  be  executed. 
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we  suspect  is  thus  unlikely  to  be  used  in  practice. 

[64]  proposes  managing  security  at  the  level  of  memory  regions  rather  than  only  at  the 
level  of  processes,  giving  a  finer  level  of  granularity  and  simplifying  shared  access  to  secret 
data  in  memory.  It  proposes  small  CPU  hardware  changes  to  make  this  more  efficient, 
such  as  having  a  hardware  cache  in  the  CPU  for  the  memory  access  descriptors,  and  uses 
encryption  for  confidentiality  of  data,  code,  and  security  descriptors. 

[26]  assumes  that  computers  will  largely  adopt  non-volatile  RAM  due  to  potential  ad¬ 
vantages  such  as  lower  power  consumption  and  “instant  on”  starts.  This  leads  to  a  new 
security  risk:  adversaries  reading  the  RAM  of  a  powered-down  system.  The  authors  solve 
this  problem  by  introducing  a  small  Memory  Encryption  Control  Unit,  or  MECU,  between 
the  CPU  cache  and  RAM,  so  that  all  data  stored  in  actual  RAM  will  be  encrypted.  Using 
AES  (Advanced  Encryption  Standard)  to  generate  a  one-time  pad  while  the  memory  fetch 
is  ongoing  and  then  simply  XOR’ing  with  the  pad  allows  the  performance  hit  of  encryp¬ 
tion  to  be  minimal.  However,  the  pad  has  to  generate  substantial  amounts  of  key  material 
with  a  low  latency  in  order  to  keep  up  with  the  substantial  memory  bandwidth  of  modern 
CPU’s  and  DMA  devices  such  as  graphics  cards,  so  we  believe  that  even  in  quantity  MECU 
chips  could  not  be  cheap.  Additional  complexity,  or  substantial  performance  hits,  come  from 
maintaining  coherency  in  the  tables  between  the  multiple  MECU’s  on  a  system. 

InfoShield  [63]  enforces  “information  usage  safety”  as  described  in  program  semantics 
by  extending  hardware  with  secure  load  and  store  operations  and  encrypting  sensitive  data 
when  it  is  stored  to  memory.  InfoShield  relies  on  the  semantics  of  the  original  source  code 
to  be  correct,  and  requires  annotation  to  specify  which  data  is  sensitive. 

4.2.2  Protecting  Secrets  with  Virtual  Machines 

Many  recent  works  utilize  virtual  machines  to  help  secure  critical  secrets.  Perhaps  the 
most  general  and  relevant  of  these  are  the  works  that  use  VMM’s  to  encrypt  application 
pages  for  confidentiality  against  any  other  accessor,  including  the  running  operating  system. 
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Overshadow  [19]  does  this  with  “multi-shadowing” ,  where  a  VMM  can  present  the  illusion 
of  multiple  versions  of  a  page  of  physical  RAM  to  a  client  VM.  This  allows  an  application  to 
quickly  access  unencrypted  versions  of  a  page  while  ensuring  the  OS  and  any  other  processes 
see  only  the  encrypted  version  of  the  page.  This  also  encrypts  hies  on  disk,  because  the 
data  on  the  page  is  already  encrypted  when  the  OS  accesses  it  for  a  disk  transfer.  This 
requires  modifications  to  the  VMM,  in  this  case  VMware  Workstation,  as  well  as  a  shim 
that  runs  when  applications  first  load,  but  means  that  the  applications  themselves  and 
Linux  kernel  can  run  unmodified.  Although  technically  they  do  not  modify  the  OS,  they  do 
require  applications  to  use  a  special  loader  and  shim  runtime.  We  do  not  modify  the  OS  nor 
applications  at  all,  although  we  do  hook  into  the  OS.  Whether  their  approach  could  actually 
be  used  in  Windows  is  not  clear,  since  it  doesn’t  easily  view  resources  as  memory  pages. 

Since  Overshadow  is  one  of  the  most  closely-related  non-hardware  solutions  to  our  work, 
we  examine  some  additional  points  of  comparison.  We  believe  that  our  solution  is  rather 
more  flexible  than  Overshadow.  For  example,  Overshadow’s  design  appears  to  require  that 
protection  domains  be  completely  isolated  from  one  another;  there  is  no  provision  for  protect¬ 
ing  information  other  than  enclosing  it  within  a  protection  (encryption/integrity)  domain. 
So  if  an  application  needs  to  be  able  to  access  secured  data  hies  belonging  to  another  ap¬ 
plication,  the  two  applications  must  be  in  the  same  protection  domain.  By  contrast,  we  not 
only  can  allow  multiple  applications  to  access  the  same  protected  data  if  desired,  but  we 
support  policies  which  can  be  used  to  specify  in  detail  what  data  hies  are  shared  and  how. 
It  is  not  clear  to  us  whether  Overshadow  requires  all  data  on  the  system  to  be  within  some 
protection  domain;  if  so,  we  speculate  that  many  existing  applications  would  be  difficult  to 
use  without  putting  all  of  them  in  the  same  protection  domain,  which  would  greatly  reduce 
the  security  added.  Additionally,  this  would  mean  Overshadow  does  not  allow  even  the 
sharing  of  unprotected  data,  since  there  would  be  no  unprotected  data. 

The  performance  impact  of  Overshadow  can  be  substantial,  because  of  the  CPU  impact 
of  decrypting  or  encrypting  a  page  whenever  access  alternates  between  the  application  and 
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the  operating  system.  This  is  more  visible  in  some  contexts  than  others.  For  example, 
a  UNIX  fork  microbenchmark  performs  at  only  20%  of  native  performance  without  Over¬ 
shadow.  Actual  applications  performed  no  slower  than  80%  of  native  performance  when  only 
anonymous  pages  were  encrypted.  When  all  pages  and  hies  were  encrypted,  performance 
was  lower;  in  particular  Apache’s  throughput  was  less  than  50%  of  its  throughput  compared 
to  running  without  Overshadow.  We  do  not  believe  our  CPU  impact  and  total  performance 
impact  are  as  significant. 

Additionally,  Overshadow  provides  only  moderate  protection  against  physical  RAM  dis¬ 
closure  attacks,  because  pages  in  physical  RAM  are  encrypted  only  if  the  page’s  last  accessor 
was  the  operating  system.  However  we  expect  such  attacks  to  be  difficult  for  malware  com¬ 
pared  to  attacks  disclosing  the  virtual  memory  space.  Overshadow  might  foil  attacks  against 
the  virtual  memory  space,  depending  on  who  accessed  the  pages  last  and  whether  the  attack 
comes  through  the  kernel,  which  would  cause  Overshadow  the  encrypt  the  pages. 

[77]  is  a  similar  work  published  concurrently  that  uses  a  similar  technique  on  the  Xen 
hypervisor,  using  manipulation  of  the  VM’s  TLB  to  provide  access  to  encrypted  and  unen¬ 
crypted  versions  of  page  frames.  We  expect  this  work  would  compare  much  the  same  against 
ours  as  Overshadow. 

[44]  uses  virtual  machines  to  isolate  the  use  of  critical  secrets  from  the  user’s  ordinary  op¬ 
erating  system.  Whenever  a  user  needs  to  use  a  critical  secret  for  authenticating  themselves, 
they  use  a  special  non-interceptable  UI  command  (e.g.,  CTRL-ALT-Delete)  to  switch  to  the 
VMM  and  then  switch  to  a  secure  VM.  The  critical  secret  is  input  there  and  appropriately 
transmitted,  e.g.,  to  a  remote  Web  site  that  explicitly  requests  it  from  the  secure  VM.  The 
secure  VM  relays  the  authentication  success  to  the  ordinary  VM  when  switching  back  to  it. 
Unfortunately,  this  means  the  user  has  to  learn  new  behavior  and  the  client  software  and 
server  software  both  have  to  be  modified  to  support  Vault. 
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4.2.3  Protecting  Cryptographic  Keys  via  Conventional  Software 


We  begin  by  examining  approaches  to  enhance  the  secrecy  of  cryptographic  keys  against 
attacks  that  may  exploit  system  vulnerabilities.  Here  we  elaborate  the  basic  ideas  of  investi¬ 
gations  under  this  approach,  assuming  that  no  copies  of  a  key  appear  in  unallocated  memory 
(see  [21,  34]  for  examples  of  techniques  that  address  this  issue).  Later  in  this  section  we  will 
examine  in  detail  certain  work  on  critical  secrets  that  is  particularly  closely  related  to  our 
work.  Without  loss  of  generality,  suppose  a  cryptographic  key  is  stored  on  a  hard  drive  (or 
memory  stick),  fetched  to  RAM  to  use,  and  occasionally  swapped  to  disk.  Thus,  we  consider 
three  aspects. 

•  Safekeeping  cryptographic  keys  on  disk:  Simply  storing  cryptographic  keys  on  hard 
drives  is  not  a  good  solution.  Once  an  attacker  has  access  to  the  disk  (even  the  raw 
disk)  the  key  can  be  compromised  through  means  such  as  an  entropy-based  method 
[61].  The  usual  defense  is  to  use  a  password  to  encrypt  a  cryptographic  key  while 
on  disk.  However,  an  attacker  can  launch  an  off-line  dictionary  attack  against  the 
password  (Hoover  and  Kausik  [37]  is  an  exception  but  with  limitations).  A  more 
sophisticated  protection  is  to  ensure  “zero”  key  appearances  on  disk  (i.e.,  a  key  never 
appears  in  its  entirety  on  disk).  For  example,  Canetti  et  al.  [15]  exploit  an  all- 
or-nothing  transformation  to  ensure  an  attacker  who  has  compromised  most  of  the 
transformed  key  bits  still  cannot  recover  the  key. 

•  Safekeeping  cryptographic  keys  when  swapped  to  disk:  The  concept  of  virtual  memory 
means  that  cryptographic  keys  in  RAM  may  be  swapped  to  disk.  Provos  [58]  presents 
a  method  to  encrypt  swaphle  for  processes  with  confidential  data.  (In  a  different 
setting,  Broadwell  et  al.  [12]  investigate  how  to  ship  crash  dumps  to  developers  without 
revealing  users’  sensitive  data.) 

•  Safekeeping  cryptographic  keys  in  RAM:  Ensuring  secrecy  of  cryptographic  keys  in 
RAM  turns  out  to  be  a  difficult  problem,  even  if  the  adversary  may  be  able  to  disclose 
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only  a  portion  of  RAM.  Recent  investigations  by  Chow  et  al.  [20,  21]  show  some  best 
practices  in  developing  secure  software  (e.g.,  clearing  sensitive  data  such  as  crypto¬ 
graphic  keys  promptly  after  their  use,  stated  years  ago  by  Viega  et  al.  [69,  70])  have 
not  been  widely  or  effectively  enforced.  Moreover,  Harrison  and  Xu  [34]  found  that  a 
key  may  have  many  copies  appearing  in  RAM.  The  present  work  makes  a  significant 
step  beyond  [34]  by  ensuring  there  are  no  copies  of  the  key  appearing  in  RAM.  As  a 
side  product,  our  Key-in- Register  method  in  Chapter  2  should  defeat  the  impressive 
recent  attack  of  extracting  cryptographic  keys  from  DRAM  chips  when  the  computers 
are  inactive  or  even  powered  off  [33]  because  a  because  a  key  never  appears  in  its  en¬ 
tirety  in  RAM.  This  work  also  highlights  that  it  may  be  necessary  to  treat  RAM  as 
untrusted,  per  our  work. 

4. 2. 3.1  Microsoft  Windows  Key  Protection 

As  an  example  of  common  practice  we  look  at  Microsoft  Windows.  Windows  standards 
provide  for  the  use  of  cryptography  via  a  Cryptographic  Service  Provider  [51],  such  as  the  one 
bundled  with  Windows,  and  more  recently  the  Cryptography  API:  Next  Generation  (CNG) 
[50].  It  appears  that  long-lived  private  keys  are  supposed  to  be  isolated  from  application 
processes  (and  hence  presumably  should  not  appear  in  process  RAM)  as  of  Windows  Vista 
and  Windows  Server  2008,  but  not  in  earlier  versions  of  Windows  [52].  (Windows  XP  Service 
Pack  3  does  include  fips.sys,  a  kernel- mode  cryptographic  module  compliant  with  FIPS 
140-1  Level  1,  which  can  provide  services  to  other  kernel  mode  drivers.  We  found  no  reason 
to  believe  these  operations  are  made  available  to  user-land  applications.) 

4.2.4  Protecting  Keys  Cryptographically 

A  completely  different  approach  to  protecting  cryptographic  keys  is  to  mitigate  the  damage 
caused  by  their  compromise.  Notable  results  include  the  notions  of  threshold  cryptosys¬ 
tems  [23],  proactive  cryptosystems  [53],  forward-secure  cryptosystems  [3,  7,  8],  key-insulated 
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cryptosystems  [24],  intrusion- resilient  cryptosystems  [39].  [75]  proposes  a  model  for  un¬ 

derstanding  digital  signature  security  of  credential  infrastructures  in  the  presence  of  key 
compromise  and  proposes  engineering  techniques  to  improve  it. 

Another  approach  to  protecting  cryptography  against  memory  disclosure  attacks  is  taken 
by  [1],  which  shows  that  certain  cryptosystems  are  naturally  resistant  to  partial- key-exposure 
memory  disclosure  attacks,  in  the  sense  that  a  large  fraction  of  the  key  bits  can  be  disclosed 
without  endangering  the  secrecy  of  the  actual  key.  Nevertheless,  our  experience  shows  that 
it  may  be  likely  that  memory  disclosure  attacks,  once  successful,  will  expose  a  cryptographic 
key  in  its  entirety  when  no  countermeasures  like  those  presented  in  this  work  are  taken. 

A  different  approach  is  taken  in  our  paper  [74],  Here  we  examine  the  possibility  of  using 
secret  sharing  to  distribute  a  user’s  key  amongst  the  computers  of  some  set  of  individuals 
whom  they  trust.  The  implications  of  this  for  security  and  availability  of  the  key  are  analyzed 
analyzed  mathematically  and  with  simulation.  The  paper  focus  includes  the  effects  from 
some  number  of  the  individual  computers  being  compromised  and  from  some  number  of 
them  not  being  available  at  any  given  point  in  time  (e.g.,  powered  off). 

All  of  these  techniques  are  orthogonal  to  our  approach,  and  hence  may  be  combined  with 
our  work. 

4.2.5  Protecting  General  Secrets 

XFI  [67]  is  a  pure  software  mechanism  that  uses  a  binary  rewriting  with  a  binary  verifier  to 
enforce  fine-grained  memory  access  control.  This  provides  access  control  for  critical  secrets 
when  stored  in  RAM,  as  long  as  all  programs  have  had  their  binaries  rewritten  and  verified. 
[13]  proposes  adding  small  CPU  hardware  changes  to  increase  the  efficiency  of  XFI,  as  well 
as  the  efficiency  of  a  related  mechanism  that  enforces  control-flow  integrity  in  order  to  make 
it  more  difficult  to  hijack  program  control  flow.  Tightlip  [79]  takes  an  interesting  approach 
to  securing  user  secrets;  when  unauthorized  applications  access  files  containing  user  secrets, 
a  “doppelganger”  duplicate  process  is  created,  which  gets  a  sanitized  version  of  the  bytes 
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from  the  file.  The  doppelganger  and  the  original  process  run  in  parallel,  until  one  attempts 
to  communicate  some  output  that  is  different  from  the  other,  at  which  time  a  privacy  breach 
might  be  occurring,  and  so  a  policy  decision  must  be  made,  e.g.,  whether  to  replace  the 
original  with  the  doppelganger  or  to  allow  the  output  of  the  original. 

4.3  Protecting  Cryptographic  Functions 

4.3.1  Protecting  Cryptographic  Functions  with  TPM 

More  recent  versions  of  the  TPM  support  a  new  functionality  mode  that  allows  the  launch  of 
highly-isolated  signed  code,  when  used  with  a  CPU  with  appropriate  support  (Intel’s  Trusted 
Execution  Technology  (TXT),  or  AMD’s  Secure  Virtual  Machine  technology  (SVM),  which 
are  included  in  many  of  their  recent  CPU’s).  This  allows  a  small  piece  of  Secure  Loader 
Block  (SLB)  code  to  launch  in  a  completely  protected  environment,  including  disabling  all 
other  CPU  cores  and  typically  DMA  as  well.  LInfortunately,  this  suffers  from  a  number  of 
limitations.  Only  64k  of  code  can  be  executed  at  a  time  in  this  fashion.  This  code  cannot 
have  any  dependencies  on  other  software  in  the  system,  e.g.,  it  cannot  call  into  other  pieces 
of  code.  Invocation  of  the  SLB  code  is  frequently  too  slow  to  use  for  many  purposes  [49],  and 
moreover  the  there  is  the  impact  on  system  performance  of  disabling  all  other  CPU’s,  CPU 
cores,  and  threads  of  CPU  execution  (e.g.,  hyperthreading).  Because  of  the  slowness  and 
the  difficulty  of  interacting  with  any  other  code  in  the  system,  the  TXT/S VM  mechanism  is 
not  suitable  for  hooking  into  the  kernel. 

Flicker  [48]  builds  on  the  TXT/S  VM  technologies,  greatly  simplifying  the  development 
of  SLB  code  for  an  application  and  providing  additional  useful  functionality  like  secured 
storage  between  executions  of  the  SLB  code.  However,  in  the  end  it  cannot  overcome  the 
fundamental  limitations  of  the  technology  as  designed  and  implemented  in  the  TPM  and 
CPU  hardware.  In  particular,  even  though  Flicker  could  be  used  to  check  for  the  existence 
of  hooks  in  a  kernel,  it  could  not  be  used  to  service  those  hooks  because  SLB  invocation  is 
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too  slow. 


4.4  Protecting  Cryptographic  Keys  and  Functions 

4.4.1  Protecting  with  Virtual  Machines 

Sujit  Sanjeev  first  implemented  the  concept  of  a  cryptographic  service  provider  secured  by 
a  VMM,  as  detailed  in  the  master’s  thesis  [60].  The  service  was  implemented  directly  in 
the  hypervisor.  Since  our  assured  digital  signature  solution  provider  is  based  on  a  partial 
cryptographic  service  provider,  we  note  a  few  of  the  important  differentiations  of  our  work: 

1.  Their  work  offers  little  if  any  protection  of  cryptographic  keys.  This  because  placing 
their  cryptographic  service  provider  in  the  hypervisor  subjects  it  to  certain  limitations. 
Most  notably,  there  is  no  facility  for  persisting  data,  so  keys  cannot  be  stored  by  the 
provider,  which  would  be  more  secure;  instead  they  have  to  be  stored  in  the  user  VM. 
This  may  be  why  their  work  contains  no  provision  for  key  management.  Indeed,  it 
appears  that  only  a  single  key  can  be  used,  and  may  even  be  hardwired  into  the  code. 

2.  We  use  the  virtual  machine  monitor  Xen,  which  has  excellent  performance  and  security 
and  is  suitable  for  production  use,  whereas  their  hypervisor  lguest  does  not  have  those 
attributes,  lguest  is  a  minimal  hypervisor  designed  chiefly  for  ease  of  implementation 
and  modification,  where  performance  and  probably  security  suffers  because  the  chief 
aim  is  simple  code,  lguest  is  described  by  its  author  as  a  “toy  hypervisor”  and  is  merely 
a  simple  kernel  module  that  multiplexes  kernel  data  structures. 

3.  We  use  a  production-grade  cryptographic  implementation,  which  would  be  suitable  for 
actual  use  in  practice.  Their  work  relics  on  the  Linux  kernel  cryptography  implemen¬ 
tation,  which  is  designed  only  to  suffice  for  expected  kernel  use,  such  as  1PSEC  and 
dm-crypt. 
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The  Terra  paper  [28]  uses  virtual  machines  to  build  an  impressive  edifice  on  a  machine 
with  a  secure  coprocessor  similar  to  a  TPM.  Virtual  machines  run  within  a  Trusted  Virtual 
Machine  Monitor,  as  one  of  two  types.  Open-box  VM’s  can  run  any  operating  systems  and 
software.  Closed-box  VM’s  run  only  software  stacks  attested  by  the  TVMM  (the  entire 
stack  must  be  attested).  Measuring  an  entire  VM  requires  an  extremely  large  number  of 
combinations  be  the  same  as  certified.  Moreover,  there  is  no  facility  for  securely  examining 
or  controlling  what’s  going  on  within  a  VM;  closed-box  VM’s  are  entirely  independent  of 
open-box  VM’s  that  users  could  run  their  own  choice  of  software  within.  Although  it  is  not 
emphasized,  Terra  appears  to  assume  the  entire  hardware  platform  is  tamper- resistant,  not 
merely  the  trusted  co-processor. 

The  technique  of  virtual  machine  introspection  [18,  29]  examines  the  contents  of  a  virtual 
machine  from  outside  the  VM.  Compared  to  our  protected  monitor  foundation,  typical  virtual 
machine  introspection  has  the  following  disadvantages: 

1.  The  semantic  gap  problem:  the  virtual  machine  state  is  much  more  easily  interpreted 
from  inside  the  VM’s  context  than  from  outside.  In  other  words,  it’s  very  difficult  to 
piece  together  what’s  going  on  inside  the  VM  from  outside. 

2.  Introspection  cannot  be  used  to  hook  functions,  because  it  provides  only  the  ability  to 
examine  the  state  of  the  VM. 

A  technique  called  virtual  machine  introspection  can  help  with  this  by  allowing  secure 
detailed  inspection  of  the  state  of  a  VM  in  a  way  that’s  hard  to  realize  on  a  physical  machine 
without  additional  hardware. 

VM  introspection  has  become  an  important  security  mechanism.  The  initial  idea  [18,  27] 
was  to  exploit  hypervisors  for  isolating  intrusion  detection  systems  (IDS)  from  the  systems 
they  monitor,  but  was  later  extended  by  numerous  studies.  For  example,  one  can  insert  traps 
into  the  monitored  VM  so  as  to  capture  certain  events  [4],  where  the  monitor  code  executes 
either  in  the  hypervisor  or  in  a  trusted  VM.  This  is  different  from  our  protected  monitor 
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because  our  security  monitor  resides  directly  in  the  User  VM,  meaning  it  has  more  power  to 
bridge  the  semantic  gap  in  VM  introspection  (e.g.,  our  security  monitor  could  understand 
the  semantics  of  objects  like  the  kernel). 

In  many  ways  the  work  that  is  most  related  to  our  protected  monitor  is  Sharif  et  al.’s 
secure  in-VM  monitoring  [62],  which  takes  advantage  of  hardware-supported  virtualization 
to  achieve  better  introspection.  That  work  only  performs  virtual  machine  introspection  and 
monitoring  of  the  untrusted  VM;  no  provision  is  made  for  secure  communication  between 
applications  and  the  secure  VM.  This  could  be  emulated  to  a  limited  extent  by  having  the 
secure  VM  examine  the  untrusted  VM  and  try  to  read  application  data,  but  there  is  no 
mechanism  for  it  to  communicate  data  back  to  applications  in  the  untrusted  VM,  and  it 
also  does  not  allow  for  synchronous  function  invocation  (applications  would  need  to  use 
something  like  a  shared- memory  busywait  model).  There  is  no  memory  protection  of  the 
application  data  and  no  protection  of  the  application  or  the  communication  process  from 
the  kernel  or  other  applications.  Moreover,  their  work  requires  Intel’s  hardware  support 
for  virtualization  (Virtualization  Technology,  or  VT),  limiting  them  to  recent  Intel  CPU’s 
(presumably  their  work  could  be  ported  to  AMD’s  similar  mechanism),  whereas  Xen  can  run 
on  essentially  any  Intel-compatible  CPU  (we  need  only  386  and  higher  with  PAE  support, 
which  was  introduced  in  the  mid  1990’s). 

Lares  [56]  extends  VM  introspection  by  using  Xen’s  memory  protection  to  protect  hooks 
that  are  placed  inside  the  guest  kernel,  including  placing  a  small  piece  of  “trampoline”  code 
inside  the  guest  VM  where  the  hooks  go  to  in  order  to  communicate  back  to  the  secure  VM. 
There  is  no  functionality  for  placing  hooks  or  protecting  hooks  in  user-land  applications  nor 
for  communicating  with  the  applications. 
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Chapter  5 

CONCLUSION 


5.1  Summary 

We  provide  and  analyze  defenses  for  malware  attacks  against  cryptographic  keys  and  crypto¬ 
graphic  functions.  In  particular,  this  dissertation  sets  forth  two  pieces  for  defending  against 
these  attacks: 

1.  Safekeeping  Cryptographic  Keys  from  Memory  Disclosure  Attacks  (Chapter  2)  is  a 
technique  for  using  a  cryptographic  key  without  ever  having  the  key  in  memory.  This 
gives  protection  against  memory  disclosure  attacks  that  otherwise  could  can  recover 
keys,  e.g.,  in  the  case  of  Apache  on  Linux  [34],  As  an  example,  we  created  a  prototype 
that  modified  RSA  private  key  encryption  in  OpenSSL  to  use  the  technique. 

This  technique  allows  complete  protection  of  keys  from  memory  disclosure  attacks, 
even  for  hardware  memory  disclosure  attacks  such  as  Firewire  ([25])  while  requiring 
no  special  hardware  (only  resources  found  in  typical  CPU’s).  Because  we  prototype 
this  on  a  single-core  machine,  we  have  to  use  a  RAM  scrambling  technique  to  store 
the  key  in  the  single-CPU-core  case,  so  we  show  that  common  attacks  such  as  entropy 
scanning,  signature  scanning,  and  content  scanning  are  infeasible. 

2.  The  Assured  Digital  Signature  Service  Provider  (Chapter  3)  allows  clients  of  digital 
signatures  to  have  high-confidence  and  remotely-attestable  secure  digital  secures  and 
key  storage,  even  in  the  presence  of  malware  running  at  elevated  privilege  levels.  Key 
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storage  services  are  secure  against  malware  and  even  raw  disk  access  (from  within 
the  VM).  Callers  are  heavily  validated  and  the  secure  domain  can  be  attested  by  the 
TPM  if  desired  so  that  remote  verifiers  can  have  high  confidence  in  the  authenticity 
of  the  signatures.  Moreover,  the  design  provides  for  a  smaller  TCB  for  cryptographic 
operations,  since  the  cryptography  implementation  can  rely  on  a  smaller  and  controlled 
software  stack. 

Chapter  3  also  introduces  the  Protected  Monitor,  which  serves  as  a  foundation  for  the 
secured  signature  system,  and  may  also  be  useful  for  many  other  security  applications, 
because  it  provides  a  platform  on  which  secured  services  can  be  built.  It  is  particularly 
well-suited  to  securing  against  malware  attacks,  although  it  can  also  be  used  for  other 
types  of  attacks.  The  monitor’s  architecture  gains  memory  protection  from  a  virtual 
machine  manager  but  still  allows  the  monitor  to  operate  from  within  the  memory 
space  of  the  virtual  machine,  unlike  virtual  machine  introspection.  This  secures  the 
monitor  against  most  attacks  from  the  user  VM  while  still  allowing  services  built  on 
the  platform  to  interact  with  the  kernel. 

5.2  Future  Work 

Here  we  discuss  some  opportunities  for  useful  future  work. 

Integrated  system.  Figure  5.1  depicts  a  possible  system  architecture  that  uses  all  of  the 
pieces  proposed  in  this  dissertation.  Keys  are  never  left  in  memory,  but  are  used  directly 
out  of  registers  (Chapter  2).  The  entire  system  is  built  on  the  protected  monitor  (Chapter 
3).  The  Assured  Digital  Signature  Service  Provider  (Chapter  3)  uses  the  key-in-register 
cryptography  for  its  cryptographic  operations,  and  provides  digital  signature  services  to 
cryptographic  applications  as  well. 

Note  we  did  not  build  this  integrated  system  for  the  dissertation  since  our  existing  pro¬ 
totypes  cannot  be  directly  combined  in  this  way.  This  is  primarily  because  Chapter  2’s  SSE 
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Figure  5.1:  An  architecture  combining  all  pieces  from  the  chapters.  Chapter  numbers  are 
set  in  parentheses. 

Key- in- Register  implementation  is  a  modified  version  of  the  OpenSSL  cryptographic  library, 
which  was  unsuitable  for  the  work  in  Chapter  3.  However,  we  expect  that  porting  the  Key-in- 
Register  mechanism  to  Peter  Gutmann’s  cryptlib,  the  cryptographic  library  used  in  Chapter 
3,  would  not  be  difficult.  It  is  important  to  note  that  two  mechanisms  are  mostly  orthogonal 
and  provide  different  kinds  of  protection;  the  Key-in-Register  implementation  would  protect 
against  memory  disclosure  attacks  in  domain  0.  Technically  the  assured  digital  signature 
service  provider  protects  against  memory  disclosure  attacks  in  domain  U,  although  that  is 
not  a  major  goal  of  the  design. 

Safekeeping  Cryptographic  Keys  from  Memory  Disclosure  Attacks.  Our  investi¬ 
gation  in  Chapter  2)  inspires  some  interesting  open  problems  such  as  the  following: 

•  First,  our  work  focused  on  showing  that  we  can  practically  and  effectively  exploit  some 
architectural  features  to  safekeep  cryptographic  keys  from  memory  disclosure  attacks. 
However,  its  security  is  based  on  heuristic  argument.  Therefore,  it  is  interesting  to 
devise  a  formal  model  for  rigorously  reasoning  about  the  security  of  our  method  and 
similar  approaches.  This  turns  out  to  be  non-trivial  partly  due  to  the  following:  If 
an  adversary  can  figure  out  the  code  that  is  responsible  for  loading  and  resembling 
cryptographic  keys  into  the  registers,  the  adversary  would  still  possibly  be  able  to 
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compromise  the  cryptographic  keys.  Therefore,  to  what  extent  we  can  say  at  which 
degree  the  adversary  can  reverse-engineer  or  understand  the  code  in  RAM?  Intuitively, 
this  would  not  be  easy,  and  is  related  to  the  long-time  open  problem  of  code  obfuscation, 
which  was  proven  to  be  impossible  in  a  very  restricted  model  in  general  [6].  However, 
it  is  open  whether  we  can  achieve  obfuscation  in  a  less  restricted  (i.e. ,  more  practical) 
model. 

•  Second,  due  to  the  limitation  of  the  volume  of  the  relevant  registers,  our  RSA  real¬ 
ization  was  not  based  on  the  Chinese  Remainder  Theorem  for  speeding  up  modular 
exponentiations,  but  rather  the  traditional  “square-multiplication”  method.  This  is 
because  the  private  key  exponent  d  itself  occupies  most  or  all  of  the  XMM  registers.  Is 
it  possible  to  circumvent  this  limitation  by,  for  example,  designing  algorithms  in  some 
fashion  similar  to  [9]? 

Protected  Monitor.  There  are  two  major  components  that  we  believe  would  be  especially 

useful  to  build  on  the  protected  monitor  and  plan  as  future  work: 

•  The  VM-Isolated  Cryptographic  Service  Provider  would  allow  clients  of  cryptographic 
services  to  have  high-confidence  cryptography  and  key  storage,  even  in  the  presence  of 
malware  running  at  elevated  privilege  levels.  Basically  this  can  be  done  by  extending 
the  crypto  implementation  used  for  the  assured  digital  signature  service  provider  into  a 
general  crypto  service  provider.  We  will  use  a  flexible  policy  mechanism  to  express  what 
applications  may  use  what  keys  and  cryptographic  services  in  terms  of  rules  describing 
various  criteria  including  suspicious  malware  behavior.  As  with  the  signature  service 
provider,  key  storage  services  are  secure  against  malware  and  even  raw  disk  access  (from 
within  the  VM).  Callers  are  heavily  validated  (authentication,  provenance- checking, 
and  checking  for  malware  behaviors  that  may  indicate  the  calling  application  is  infected 
with  malware).  Moreover,  the  design  provides  for  a  smaller  TCB  for  cryptographic 


operations,  since  the  cryptography  implementation  can  rely  on  a  smaller  and  controlled 
software  stack. 

•  Transparent  Critical  Secrets  Protection  would  transparently  secure  critical  secrets  on 
disk  from  disclosure  via  malware  (such  as  for  identity  theft).  No  modifications  would 
be  required  for  legacy  applications  nor  for  the  operating  system.  The  persistent  stor¬ 
age  is  not  accessible  without  authentication  and  approval,  even  with  raw  disk  access 
(from  within  the  virtual  machine).  The  goal  is  to  have  hies  with  secrets  are  identified 
automatically;  the  user  does  not  have  to  manually  specify  hies  or  policies.  The  user 
may  specify  policies  if  desired. 

Another  opportunity  for  future  work  is  determining  how  to  measure  the  domain  U  kernel 
code,  without  interference  from  data  structures  and  runtime  patching  that  cause  variation 
in  the  contents  of  the  Linux  2.6  kernel  code  space.  This  would  allow  us  to  describe  the  state 
of  the  domain  U  kernel  as  part  of  our  attested  signatures,  so  that  a  veriher  could  attest 
that  the  kernel  binary  was  not  compromised.  One  way  to  do  this  would  be  to  develop  a 
comprehensive  list  of  parts  of  the  kernel  that  can  change,  and  simply  omit  all  of  those  when 
measuring.  The  challenge  would  be  identifying  these  bytes  in  a  way  that  is  robust  to  changes 
in  the  kernel  caused  by  continuing  kernel  development. 

Lastly,  might  there  be  some  way  to  integrate  the  cryptographic  service  provider  and 
transparent  critical  secrets  protection  to  provide  something  very  general? 

Assured  Digital  Signature  Service  Provider.  We  would  like  to  develop  and  implement 

a  better  scheme  for  attesting  the  security  and  isolation  of  the  assured  digital  signature  ser¬ 
vice  provider  without  the  use  of  a  TPM.  This  is  particularly  desirable  since  our  experiments 
show  the  TPM  has  a  significant  performance  impact,  as  well  as  not  always  being  available. 
Achieving  this  without  hardware  support  is  a  difficult  remote  attestation  problem,  particu¬ 
larly  since  we  would  like  it  to  be  possible  to  verify  signatures  offline  (i.e. ,  without  interaction 
with  the  original  signing  computer).  For  example,  it  might  be  possible  to  achieve  this  by 


forming  networks  of  machines  which  attest  each  other  and  then  make  a  group  signature 
attesting  the  signing  machine. 

Secure  in-VM  monitoring  [62] .  Would  there  be  any  additional  value  if  we  incorporated 
the  secure  in-VM  monitoring  in  [62]  into  the  protected  monitor?  Or  into  the  signature  service 
provider  system  specifically? 
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Appendix  A 

GLOSSARY  OF  ACRONYMS 


BIOS  Basic  Input/Output  System.  Basic  PC  firmware. 

CAPTCHA  Completely  Automated  Public  Turing  test  to  Tell  Computers  and  Humans 
Apart.  A  method  for  establishing  that  requests  are  made  by  an  actual  human  rather 
than  an  automated  process. 

CSP  Cryptographic  Service  Provider. 

GDT  Global  Descriptor  Table  (GDT).  Specifies  call  gates  and  other  system  descriptors. 

IDS  Intrusion  Detection  System. 

IDT  Interrupt  Descriptor  Table. 

IPSEC  Internet  Protocol  SECurity.  Specifies  encryption  and  authentication  standards  for 
securing  the  IP  layer  of  the  TCP/IP  stack. 

LKM  Loadable  Kernel  Module. 

LPC  Low  Pin  Count.  Used  to  describe  the  bus  for  the  TPM,  which  was  designed  to  be 
inexpensive  and  thus  have  a  low  pin  count,  rather  than  fast. 

MFN  Machine  Frame  Number.  The  number  of  the  physical  page  frame  (as  seen  from  the 
perspective  of  the  hypervisor  rather  than  the  virtual  machine). 

MMX  MultiMedia  extensions.  An  Intel  standard  for  SIMD  graphics  processing. 
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PAE  Physical  Address  Extensions.  x86  addressing  mode  originally  designed  to  allow  32-bit 
CPU’s  to  address  more  than  4  gigabytes  of  RAM. 

PTE  Page  Table  Entry. 

SIMD  Single  Instruction  Multiple  Data. 

SLB  Secure  Loader  Block. 

SMI  System  Management  Interrupt,  a  capability  of  PC  firmware  to  enter  a  highly  privileged 
firmware  mode  known  as  System  Management  Mode. 

SSE  Streaming  SIMD  Extensions. 

SVM  Secure  Virtual  Machine  technology.  AMD’s  secure  late  launch  technology,  competitor 
to  Intel  TXT. 

TCB  Trusted  Computing  Base. 

TCG  Trusted  Computing  Group.  Industry  consortium  that  created  TPM. 

TLB  Translation  Look-aside  Buffer.  Essentially  a  hardware  cache  for  virtual-to-memory 
mappings. 

TLS  Transport  Layer  Security.  Successor  to  Secure  Sockets  Layer  (SSL). 

TOCTOU  Time-Of-Check  Timc-Of-Use  attack.  Type  of  attack  that  relies  on  changing 
data  or  privileges  between  the  time  they  are  verified  and  the  time  they  are  used. 

TPM  Trusted  Platform  Module. 

TSS  TCG  Software  Stack.  Software  interface  to  TPM. 

TXT  Trusted  execution  Technology.  Intel’s  secure  late  launch  technology,  competitor  to 
AMD’s  SVM  technology. 
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VIRQ  Virtual  IRQ. 


VMM  Virtual  Machine  Monitor.  Also  known  as  a  hypervisor. 


XMM  Set  of  multimedia  registers  designed  by  Intel.  The  name  derives  from  spelling  MMX, 
which  was  the  name  of  the  previous  iteration,  backwards. 


103 


Appendix  B 

LIST  OF  AUTHOR’S  PUBLICATIONS 
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B.l  Security  Publications 

1.  Shouhuai  Xu,  Xiaohu  Li,  Paul  Parker,  and  Xueping  Wang.  Exploiting  Trust-Based 
Social  Networks  for  Distributed  Protection  of  Sensitive  Data,  IEEE  Transactions  on 
Information  Forensics  &  Security  (IEEE  TIFS),  accepted. 

2.  T.  Paul  Parker  and  Shouhuai  Xu.  A  Method  for  Safekeeping  Cryptographic  Keys 
from  Memory  Disclosure  Attacks.  International  Conference  on  Trusted  Systems  (IN¬ 
TRUST),  2009,  Springer  Lecture  Notes  in  Computer  Science,  vol.  6163,  pp  39-59. 

3.  X.  Li,  P.  Parker,  and  S.  Xu.  A  Stochastic  Model  for  Quantitative  Security  Analysis  of 
Networked  Systems.  IEEE  Transactions  on  Dependable  and  Secure  Computing  (IEEE 
TDSC),  accepted  2008,  to  appear  (vol.  8,  no.  1,  January-February  2011,  pp  28-43). 
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Signing:  Attack-resilience  vs.  Availability.  ASIACCS’08,  pp.  325-336. 
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(CANS  2007),  Lecture  Notes  in  Computer  Science,  Vol.  4856,  pp.  228-246,  Springer, 
2007. 
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